Shutdown failure: Assertion 'sd_id128_randomize(&id) >= 0' failed at ../src/core/dbus.c:657, function bus_on_connection(). Aborting.

Bug #1519499 reported by Sven R. Kunze on 2015-11-24
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Undecided
Unassigned

Bug Description

This is a follow-up on this issue here: https://github.com/lxc/lxd/issues/1336 I cannot stop a LXD container gently and as it seems the issue lies within ubuntu/systemd which does not handle SIGPWR correctly.

Stéphane Graber (stgraber) wrote :

Can you give some more details on the version of:
 - lxd package on the host
 - liblxc1 package on the host
 - lxcfs package on the host
 - Ubuntu version in the container
 - systemd version in the container

And confirm that:
 - This behavior is reproducible for you with the following steps:
    lxd-images import ubuntu (wily or xenial) --alias ubuntu
    lxc launch ubuntu test
    sleep 10s
    lxc stop test
 - Executing "shutdown" manually inside the container (through lxc exec) does shutdown the container
 - Manually doing "kill -SIGPWR <PID>" where <PID> is the init pid as reported by "lxc info test", also doesn't shutdown the container for you.

With those we should have a better idea of where the bug (seems systemd is most likely) and we'll triage accordingly.

Thanks!

Changed in lxd (Ubuntu):
status: New → Incomplete
Sven R. Kunze (srkunze) wrote :

The requestes versions:

>>> lxc --version
0.22

>>> apt search liblxc1
Sorting... Done
Full Text Search... Done
liblxc1/trusty,now 1.1.5-0ubuntu3~ubuntu14.04.1~ppa1 amd64 [installed,automatic]
  Linux Containers userspace tools (library)

>>> apt search lxcfs
Sorting... Done
Full Text Search... Done
lxcfs/trusty,now 0.12-0ubuntu5~ubuntu14.04.1~ppa1 amd64 [installed]
  FUSE based filesystem for LXC

>>> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 15.10
Release: 15.10
Codename: wily

>>> systemctl --version
systemd 225
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN

Sven R. Kunze (srkunze) wrote :

Confirmations for:

1) lxd-images import ubuntu wily --alias ubuntu; lxc launch ubuntu test; sleep 10s; lxc stop test
Confirmed: It does not shutdown.

2) Executing "shutdown" manually inside the container (through lxc exec) does shutdown the container
Confirmed: It does not shutdown.

3) Manually doing "kill -SIGPWR <PID>" where <PID> is the init pid as reported by "lxc info test", also doesn't shutdown the container for you.
Confirmed: It does not shutdown.

These are the process trees before and after 1), 2) and 3):

BEFORE
22172 ? Ss 0:00 /usr/bin/lxd forkstart test /var/lib/lxd/containers /tmp/lxd_lxc_startconfig_380151542
22191 ? Ss 0:00 \_ /sbin/init
22413 ? Ss 0:00 \_ /lib/systemd/systemd-journald
22857 ? Ss 0:00 \_ dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
22991 ? Ss 0:00 \_ /usr/sbin/irqbalance --pid=/var/run/irqbalance.pid
22992 ? Ss 0:00 \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
23040 ? Ss 0:00 \_ /usr/sbin/cron -f
23092 ? S 0:00 \_ /bin/sh /etc/init.d/ondemand background
23119 ? S 0:00 | \_ sleep 60
23093 ? Ss 0:00 \_ /lib/systemd/systemd-logind
23152 ? Ssl 0:00 \_ /usr/lib/accountsservice/accounts-daemon
23207 ? S 0:00 \_ dnsmasq -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --listen-address 10.0.4.1 --dhcp-range 10.0.4.2,10.0.4.254 --dhcp-lease-max=253 --dhcp-no-override --except-interface=lo --interface=lxcbr0 --dhcp-leasefile=/var/lib/misc/dnsmasq.lxcbr0.leases --dhcp-authoritative
23215 ? Ss 0:00 \_ /sbin/cgproxy
23240 ? Ssl 0:00 \_ /usr/sbin/rsyslogd -n
23288 ? Ss 0:00 \_ /usr/sbin/atd -f
23329 ? Ssl 0:00 \_ /usr/lib/policykit-1/polkitd --no-debug
23633 ? Ss 0:00 \_ /usr/sbin/sshd -D
23751 pts/4 Ss+ 0:00 \_ /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt220

AFTER
22172 ? Ss 0:00 /usr/bin/lxd forkstart test /var/lib/lxd/containers /tmp/lxd_lxc_startconfig_380151542
22191 ? Ss 0:00 \_ /sbin/init
22413 ? Ss 0:00 \_ /lib/systemd/systemd-journald
22992 ? Ss 0:00 \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
23092 ? Z 0:00 \_ [ondemand] <defunct>
24998 ? Zs 0:00 \_ [systemctl] <defunct>

Stéphane Graber (stgraber) wrote :

Thanks, that does indeed seem to point towards a systemd issue, I've added a systemd task, Martin should be able to pick things up from there.

Can you also tell us what version of the kernel you're using? We've seen some weird behavior with systemd on some specific kernel releases in the past.

Also, does "journalctl" in the container after attempting a shutdown print any useful information or error messages?

Thanks!

Changed in lxd (Ubuntu):
status: Incomplete → Triaged
Stéphane Graber (stgraber) wrote :

Just to be clear for any bug triager looking at this. We're keeping a lxd task open just for tracking in case there's something wrong going on with one of our components (cgmanager or lxcfs being the most likely candidate).

Once this is confirmed to be a pure systemd issue, we'll close the lxd task.

Sven R. Kunze (srkunze) wrote :
Download full text (4.7 KiB)

>>> uname -a
Linux mexico 3.13.0-68-generic #111-Ubuntu SMP Fri Nov 6 18:17:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> lxc exec test -- uname -a
Linux test 3.13.0-68-generic #111-Ubuntu SMP Fri Nov 6 18:17:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

journalctl only shows messages related to container startup:

>>> lxc exec test -- journalctl
-- Logs begin at Mon 2015-11-30 21:03:37 UTC, end at Mon 2015-11-30 21:03:53 UTC. --
Nov 30 21:03:37 test systemd-journal[223]: Runtime journal (/run/log/journal/) is currently using 8.0M.
                                           Maximum allowed usage is set to 799.2M.
                                           Leaving at least 1.1G free (of currently available 7.7G of space).
                                           Enforced usage limit is thus 799.2M.
Nov 30 21:03:37 test systemd-journal[223]: Runtime journal (/run/log/journal/) is currently using 8.0M.
                                           Maximum allowed usage is set to 799.2M.
                                           Leaving at least 1.1G free (of currently available 7.7G of space).
                                           Enforced usage limit is thus 799.2M.
Nov 30 21:03:37 test systemd-journal[223]: Journal started
Nov 30 21:03:37 test systemd-remount-fs[108]: /bin/mount for / exited with exit status 1.
Nov 30 21:03:37 test systemd-remount-fs[108]: mount: can't find LABEL=cloudimg-rootfs
Nov 30 21:03:37 test mount[133]: mount: permission denied
Nov 30 21:03:37 test loadkeys[145]: Couldn't get a file descriptor referring to the console
Nov 30 21:03:37 test apparmor[157]: lxc
Nov 30 21:03:37 test apparmor[157]: * Not starting AppArmor in container
Nov 30 21:03:37 test apparmor[157]: ...done.
Nov 30 21:03:37 test systemd[1]: console-setup.service: Main process exited, code=exited, status=1/FAILURE
Nov 30 21:03:37 test systemd[1]: console-setup.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: console-setup.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: Failed to start Set console keymap.
Nov 30 21:03:37 test systemd[1]: console-setup.service: Unit entered failed state.
Nov 30 21:03:37 test systemd-journal[223]: Forwarding to syslog missed 1 messages.
Nov 30 21:03:37 test systemd[1]: console-setup.service: Failed with result 'exit-code'.
Nov 30 21:03:37 test systemd[1]: Started LSB: AppArmor initialization.
Nov 30 21:03:37 test systemd[1]: plymouth-read-write.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: plymouth-read-write.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: plymouth-read-write.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: plymouth-read-write.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: Started Tell Plymouth To Write Out Runtime Data.
Nov 30 21:03:37 test systemd[1]: Started Load/Save Random Seed.
Nov 30 21:03:37 test systemd[1]: dns-clean.service: Failed to kill control group: Invalid argument
Nov 30 21:03:37 test systemd[1]: dns-clean.service: Failed to kill control group...

Read more...

Sven R. Kunze (srkunze) wrote :

> We've seen some weird behavior with systemd on some specific kernel releases in the past.

Not sure if that info helps: I have been installing several different kernels in the mean time; all of them share the same behavior.

Sven R. Kunze (srkunze) wrote :

Also note: confirmed for lxd==0.23

Stéphane Graber (stgraber) wrote :

Hmm, that journal log looks like it could be a lxcfs problem though I don't remember seeing problems on shutdown in the past...
We did have a rather bad bug at startup a little while back though.

Martin Pitt (pitti) wrote :

I'm on current xenial, lxd version is 0.23, which you confirmed had this issue:

$ lxc --version
0.23

I created a wily and xenial container:
$ lxc launch images:ubuntu/wily/amd64 w1
$ lxc launch images:ubuntu/xenial/amd64 x1

"lxc stop" works for both w1 and x1. When I follow (-f) the journal in a container with

$ lxc exec x1 -- journalctl -f

before "stop", it starts with

Dec 01 05:47:43 w1 systemd[1]: Received SIGPWR.
Dec 01 05:47:43 w1 systemd[1]: Failed to reset devices.list on /lxc/w1/system.slice/sigpwr-container-shutdown.service: Permission denied
Dec 01 05:47:43 w1 systemd[1]: Starting Shut down containers on SIGPWR...

and then a lot of shutdown messages. I suppose your journal output in comment #6 was from *after* you did "lxc stop"? If not, please re-send it.

Let's check for you that everything is in place. Can you please copy&paste the output of

$ lxc exec test -- ls -l /lib/systemd/system/sigpwr.target.wants/sigpwr-container-shutdown.service /lib/systemd/system/sigpwr-container-shutdown.service

$ lxc exec test -- systemd-detect-virt

and then *after* running "lxc stop", the output of

$ lxc exec test -- systemctl status -l sigpwr-container-shutdown.service

Thanks!

Changed in systemd (Ubuntu):
status: New → Incomplete
Sven R. Kunze (srkunze) wrote :

Thanks for looking at this, Martin.

Minor correction: I work with Ubuntu 14.04. Containers are Wily.

> I suppose your journal output in comment #6 was from *after* you did "lxc stop"?

Correct, it was *after* "lxc stop" (resp. all the other variants)

>>> lxc exec test -- ls -l /lib/systemd/system/sigpwr.target.wants/sigpwr-container-shutdown.service /lib/systemd/system/sigpwr-container-shutdown.service
-rw-r--r-- 1 root root 153 Oct 15 11:34 /lib/systemd/system/sigpwr-container-shutdown.service
lrwxrwxrwx 1 root root 36 Oct 15 12:02 /lib/systemd/system/sigpwr.target.wants/sigpwr-container-shutdown.service -> ../sigpwr-container-shutdown.service

>>> lxc exec test -- systemd-detect-virt
lxc

>>> lxc stop test
^C>>>lxc exec test -- systemctl status -l sigpwr-container-shutdown.service
^C

The last call basically hangs (also without the -l option).

Martin Pitt (pitti) wrote :

Sven, can you please run it this way around:

lxc exec test -- systemd-analyze set-log-level debug
lxc exec test -- journalctl -f 2>&1 | tee /tmp/journal.txt

and then (in another terminal) "lxc stop test". This should trigger a lot of log output. Please attach /tmp/journal.txt after it settled down.

Sven R. Kunze (srkunze) wrote :

Done.

Martin Pitt (pitti) wrote :

Ah, thanks! This finally revealed something interesting. So pretty much everything is shut down already, but the last step triggers an assertion:

Dec 01 17:23:49 test systemd[1600]: systemd-poweroff.service: Executing: /bin/systemctl --force poweroff
Dec 01 17:23:49 test systemd[1]: Assertion 'sd_id128_randomize(&id) >= 0' failed at ../src/core/dbus.c:657, function bus_on_connection(). Aborting.
Dec 01 17:23:50 test systemd[1]: Caught <ABRT>, dumped core as pid 1612.
Dec 01 17:23:50 test systemd[1]: Freezing execution.

Changed in systemd (Ubuntu):
status: Incomplete → New
summary: - [systemd] SIGPWR doesn't always lead to shutdown
+ Shutdown failure: Assertion 'sd_id128_randomize(&id) >= 0' failed at
+ ../src/core/dbus.c:657, function bus_on_connection(). Aborting.
Sven R. Kunze (srkunze) wrote :

Nice. Is there something else I can do here?

Martin Pitt (pitti) wrote :

I first need to check some background what this is trying to do and why it could fail, as I can't reproduce that. I'll followup here with further questions. For now, if you don't mind, could you try upgrading the container to https://launchpad.net/~pitti/+archive/ubuntu/systemd ? That's the latest (tested) systemd upstream snapshot from today, perhaps it already fixes that.

Sven R. Kunze (srkunze) wrote :

Tested. However, it does not fix the issue.

Martin Pitt (pitti) wrote :

This was discussed some time ago in https://github.com/systemd/systemd/pull/226 . So the issue is that /dev/urandom is a bind mount (from the "outside" lxc/lxd), and systemd stops it before shutdown (can be seen nicely in the log).

So it's clear why the assertion happens. It's less clear why it's not reproducible for me, and how to fix this properly, but I'll discuss that with upstream.

Stéphane Graber (stgraber) wrote :

As a reminder:
 - http://lists.freedesktop.org/archives/systemd-devel/2015-January/027246.html
 - http://lists.freedesktop.org/archives/systemd-devel/2015-January/027323.html

http://cgit.freedesktop.org/systemd/systemd/commit/?id=874d3404cbf2363604106c8f86683db4082691ea

The code change at that point was to ignore all mounts under /dev, /proc and /sys, maybe things changed since and re-introduced the issue somehow, or we missed a code path somewhere.

no longer affects: lxd (Ubuntu)
Stéphane Graber (stgraber) wrote :

Also note that I'm removing the lxd task as this does appear to be a systemd issue, ~ubuntu-lxc will remain subscribed to the issue though.

Sven R. Kunze (srkunze) wrote :

@Martin I'd be glad to assist if you require more information to make it reproducible.

Sven R. Kunze (srkunze) wrote :

@Martin Is there some update on this?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Hans Deragon (deragon) wrote :

I too cannot stop a container, but it is running Ubuntu Xenial Xerus 16.04 LTS. I cannot tell if the root cause is the same as reported by this bug report though; I am not familiar with systemd.

I created a few days ago a new container on an updated Ubuntu Trusty Thar 14.04 LTS host using the following command:

 lxc launch images:ubuntu/xenial/amd64 server1

Now, the following command hangs/freezes:

 lxc stop server1

I could assist in debugging, if required.

Sven R. Kunze (srkunze) wrote :

@Hans

To confirm that it's the same issue, you might replay the instructions Martin gave at #12: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1519499/comments/12

It should reveal the triggered assertion.

Hans Deragon (deragon) wrote :

Looks like it is the same problem.

Sven R. Kunze (srkunze) wrote :

Indeed it does. Glad to see the bug persists even on 16.04 and its not just a mal-configuration on my side.

Hans Deragon (deragon) wrote :

Shouldn't this bug be assigned to some LXC/LXD task? After all, a normal installation of the distributions on a computer works well. 16.04 shutsdown properly on a laptop, for example; systemd simply does not hang. Thus this bug is somewhat related to LXC/LXD. Maybe the bare minimum images are at fault?

Martin Pitt (pitti) wrote :

This is likely fixed in current systemd versions already, but the recent commit https://github.com/systemd/systemd/commit/ad2706db7cce should fix the remaining traces of this.

Current systemd package in https://launchpad.net/~pitti/+archive/ubuntu/systemd contains this patch, if you want to give it a try on Ubuntu 16.10.

Changed in systemd (Ubuntu):
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers