Docker registry doesn't stay up and keeps restarting

Bug #1879690 reported by Juerg Haefliger
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
The change applied for bug 1857257 and its followup fix bug 1876645, which were released on focal and eoan -updates, introduced a regression on overlayfs, breaking docker snap.

[Test case]
See original bug report.

[Fix]
While we don't have a final fix the solution for now is to revert the following commits:

UBUNTU: SAUCE: overlayfs: fix shitfs special-casing
UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay

[Regression potential]
Low. Reverting these two commits will introduce back the issue reported on bug 1857257, but will fix the other use cases which was broken by the latest release.

Original bug report.
-----------------------------------
Tested kernels:
Focal 5.4.0-31.35
Eoan 5.3.0-53.47

To reproduce:
1) Spin up a cloud image
2) snap install docker
3) auth_folder=/var/snap/docker/common/auth
4) mkdir -p $auth_folder
5) docker run --entrypoint htpasswd registry:2 -Bbn user passwd > $auth_folder/htpasswd
6) docker run -d -p 5000:5000 --restart=always --name registry \
  -v $auth_folder:/auth \
  -e "REGISTRY_AUTH=htpasswd" \
  -e "REGISTRY_AUTH_HTPASSWD_REALM=Registry Realm" \
  -e REGISTRY_AUTH_HTPASSWD_PATH=/auth/htpasswd \
   registry:2

On a good kernel 'docker ps' shows something like:
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a346b65b4509 registry:2 "/entrypoint.sh /etc…" 14 seconds ago Up 12 seconds 0.0.0.0:5000->5000/tcp registry

On a bad kernel:
 docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0322374f1b1d registry:2 "/entrypoint.sh /etc…" 5 seconds ago Restarting (2) 1 second ago registry

Note status 'Restarting' on the bad kernel.

This seems to be introduce by any of the following commits:
b3bdda24f1bc UBUNTU: SAUCE: overlayfs: fix shitfs special-casing
6f18a8434050 UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay
629edd70891c UBUNTU: SAUCE: shiftfs: record correct creator credentials
cfaa482afb97 UBUNTU: SAUCE: shiftfs: fix dentry revalidation

Kernels that don't have these commits seem fine.

Juerg Haefliger (juergh)
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1879690

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Eoan):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Changed in linux (Ubuntu Eoan):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Focal):
status: Incomplete → Confirmed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I have reverted the following two commits from both Eoan 5.3.0-53 and Focal 5.4.0-31 and I was not able to reproduce the problem anymore:

Revert "UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay"
Revert "UBUNTU: SAUCE: overlayfs: fix shitfs special-casing"

Revision history for this message
Juerg Haefliger (juergh) wrote :

[ 267.532883] audit: type=1400 audit(1589983933.896:66): apparmor="DENIED" operation="open" profile="snap.docker.dockerd" name="/entrypoint.sh" pid=3373 comm="entrypoint.sh" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

in dmesg looks suspicious.

Revision history for this message
Juerg Haefliger (juergh) wrote :

dmesg from good kernel 5.4.0-29-generic

Revision history for this message
Juerg Haefliger (juergh) wrote :

dmesg from bad kernel 5.4.0-31-generic

Revision history for this message
Seth Forshee (sforshee) wrote :

I confirmed that 5.4.0-29 does not show the problem, and -31 does. Then I built -31 with these three patches reverted:

UBUNTU: SAUCE: overlayfs: fix shitfs special-casing
UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay
UBUNTU: SAUCE: overlayfs: allow with shiftfs as underlay

With those reverted I see the problem. It appears that this is a problem that the shiftfs hack for overlayfs was covering up, and then when the upstream behavior was restored whenever shiftfs is not the underlay it revealed the issue.

Changed in linux (Ubuntu Eoan):
status: Confirmed → In Progress
Changed in linux (Ubuntu Focal):
status: Confirmed → In Progress
description: updated
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :
Revision history for this message
Stéphane Graber (stgraber) wrote :

To confirm that this isn't shiftfs related and that we were just causing the issue to be hidden, I've run the same test on OpenSuse tumbleweed.

I chose that distro because it's apparmor-enabled, has snapd and a 5.4 kernel.

```
localhost:~ # snap install docker
docker 18.09.9 from Canonical* installed
localhost:~ # auth_folder=/var/snap/docker/common/auth
localhost:~ # mkdir -p $auth_folder
localhost:~ # docker run --entrypoint htpasswd registry:2 -Bbn user passwd > $auth_folder/htpasswd
Unable to find image 'registry:2' locally
2: Pulling from library/registry
486039affc0a: Pulling fs layer
ba51a3b098e6: Pulling fs layer
8bb4c43d6c8e: Pulling fs layer
6f5f453e5f2d: Pulling fs layer
42bc10b72f42: Pulling fs layer
6f5f453e5f2d: Waiting
42bc10b72f42: Waiting
ba51a3b098e6: Download complete
486039affc0a: Verifying Checksum
486039affc0a: Download complete
8bb4c43d6c8e: Verifying Checksum
8bb4c43d6c8e: Download complete
6f5f453e5f2d: Verifying Checksum
6f5f453e5f2d: Download complete
42bc10b72f42: Verifying Checksum
42bc10b72f42: Download complete
486039affc0a: Pull complete
ba51a3b098e6: Pull complete
8bb4c43d6c8e: Pull complete
6f5f453e5f2d: Pull complete
42bc10b72f42: Pull complete
Digest: sha256:7d081088e4bfd632a88e3f3bcd9e007ef44a796fddfe3261407a3f9f04abe1e7
Status: Downloaded newer image for registry:2
localhost:~ # docker run -d -p 5000:5000 --restart=always --name registry \
> -v $auth_folder:/auth \
> -e "REGISTRY_AUTH=htpasswd" \
> -e "REGISTRY_AUTH_HTPASSWD_REALM=Registry Realm" \
> -e REGISTRY_AUTH_HTPASSWD_PATH=/auth/htpasswd \
> registry:2
cba1ec94734a8a198fa0c474d9873233958fad6cdafe93d2ccf4d701ecab55ff
localhost:~ # docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cba1ec94734a registry:2 "/entrypoint.sh /etc…" 5 seconds ago Restarting (2) Less than a second ago registry
localhost:~ # uname -a
Linux localhost 5.4.10-1-default #1 SMP Thu Jan 9 15:45:45 UTC 2020 (556a6fe) x86_64 x86_64 x86_64 GNU/Linux
localhost:~ #
```

As you can see, the exact same thing happen there. So this is an apparmor kernel bug or some issue with the snapd or docker snap, this isn't a shiftfs bug and reverting the change would just expose a different bug rather than actually fix things.

Revision history for this message
Stéphane Graber (stgraber) wrote :

/var/log/audit.log on Suse logs the same:

    type=AVC msg=audit(1590086639.489:8595): apparmor="DENIED" operation="open" profile="snap.docker.dockerd" name="/entrypoint.sh" pid=5656 comm="entrypoint.sh" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
Revision history for this message
Nathan Bryant (nbryant42) wrote :

I noticed something unexpected with the kernel in -proposed: /proc/version_signature reverts the upstream patchlevel to 5.4.34. If there's a mistake and it's really reverting all the upstream SRU patches, I may have a problem.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Juerg Haefliger (juergh) wrote :

It's the same upstream patchlevel as the previous kernel:

$ cat /proc/version_signature
Ubuntu 5.4.0-31.35-generic 5.4.34

Revision history for this message
Nathan Bryant (nbryant42) wrote :

My bad. Looks like 5.4.0-32 never made it out of -proposed.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I could not reproduce the issue with Eoan's kernel currently in -proposed (5.3.0-55.49), setting verification done.

tags: added: verification-done-eoan
removed: verification-needed-eoan
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I could not reproduce the issue with Focal's kernel currently in -proposed (5.4.0-33.37), setting verification done.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.3.0-55.49

---------------
linux (5.3.0-55.49) eoan; urgency=medium

  * eoan/linux: 5.3.0-55.49 -proposed tracker (LP: #1879931)

  * Docker registry doesn't stay up and keeps restarting (LP: #1879690)
    - Revert "UBUNTU: SAUCE: overlayfs: fix shitfs special-casing"
    - Revert "UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as
      underlay"

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 21 May 2020 14:20:47 +0200

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-33.37

---------------
linux (5.4.0-33.37) focal; urgency=medium

  * focal/linux: 5.4.0-33.37 -proposed tracker (LP: #1879926)

  * Docker registry doesn't stay up and keeps restarting (LP: #1879690)
    - Revert "UBUNTU: SAUCE: overlayfs: fix shitfs special-casing"
    - Revert "UBUNTU: SAUCE: overlayfs: use shiftfs hacks only with shiftfs as
      underlay"

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 21 May 2020 14:34:26 +0200

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (linux-oracle-5.4/5.4.0-1019.19~18.04.1)

All autopkgtests for the newly accepted linux-oracle-5.4 (5.4.0-1019.19~18.04.1) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

zfs-linux/unknown (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#linux-oracle-5.4

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-42.46

---------------
linux (5.4.0-42.46) focal; urgency=medium

  * focal/linux: 5.4.0-42.46 -proposed tracker (LP: #1887069)

  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux (5.4.0-41.45) focal; urgency=medium

  * focal/linux: 5.4.0-41.45 -proposed tracker (LP: #1885855)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * CVE-2019-19642
    - kernel/relay.c: handle alloc_percpu returning NULL in relay_open

  * CVE-2019-16089
    - SAUCE: nbd_genl_status: null check for nla_nest_start

  * CVE-2020-11935
    - aufs: do not call i_readcount_inc()

  * ip_defrag.sh in net from ubuntu_kernel_selftests failed with 5.0 / 5.3 / 5.4
    kernel (LP: #1826848)
    - selftests: net: ip_defrag: ignore EPERM

  * Update lockdown patches (LP: #1884159)
    - SAUCE: acpi: disallow loading configfs acpi tables when locked down

  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc

  * Introduce the new NVIDIA 418-server and 440-server series, and update the
    current NVIDIA drivers (LP: #1881137)
    - [packaging] add signed modules for the 418-server and the 440-server
      flavours

 -- Khalid Elmously <email address hidden> Thu, 09 Jul 2020 19:50:26 -0400

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.