systemd-udev fails when processing many logical volumes on boot

Bug #1185394 reported by Stefan Bader on 2013-05-29
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
High
Martin Pitt

Bug Description

With recent systemd/udev updates in Saucy I see problems when /dev gets moved to the real root. It seems to be related to running vgscan/vgchange through watershed by systemd-udevd. The symptoms include a crash of one systemd-udevd process and several messages about systemd-udevd (or childs of it) not finding files in /dev.

Setup:

sda: contains the root fs (no separate /boot) and a swap partition
sdb: contains vg1 with /home
sdc and sdd: contain vg2 with 31 LVs, one of them mounted to /home/isos

Stefan Bader (smb) wrote :

Debugging ideas (from discussion with pitti on IRC):

(1) add a ps | grep ... wait loop into the hook until really all udev processes are gone that would confirm that the child processes actually run into the /dev issue
(2) some ulimit -c unlimited and copying core dumps to /run/udev/
(3) move 60-persistent-storage-dm.rules from IMPORT{program}="/sbin/blkid -o ... to IMPORT{builtin}="blkid --offset..., to see whether it's the external program which causes trouble
and (4) it would be helpful if you could confirm that you have udev processes running after the udevadm control --exit (with http://paste.ubuntu.com/5713055/)

Before I forget everything over the long weekend, some braindump:

I tried to reproduce this with today's server iso install in a VM. I
used two virtio disks the first one (sda) with a 5 GB root and two
PVs, the second one (sdb) with two more PVs. I built two VGs with 8
LVs in total, putting /home on VG1/LV1, /home/iso on VG2/LV2 (trying
to replicate Stefan's setup), and mounted the other LVs to places like
/data1. This has booted fine 10 out of 10 times.

I tried to add a sleep to 85-lvm2.rules to simulate a long-running
vgchange, but it still works fine. I modified udev's initramfs-bottom
script to show processes before (got some 10 udev workers) and after
(no udev processes left) the udevadm control --exit, so this also
works.

The conjecture is that due to our 0024-avoid-exit-deadlock-for-dm_cookie.patch
the workers for the LVs are not stopped by the control --exit and thus
race with the following mount --move of /dev to /root/dev. I could
never reproduce this in my VM, thus this is what I would like to find
out from Stefan with (4) and (1). Our udev 175 initramfs script
behaved in pretty much the same way, but we should compare the outputs
of (1) and (4) between udev 175 and 202.

I'd also like to look at the core dump to see where it actually
crashes. I guess something tries to open /dev/null and does not expect
a failure, and thus runs into a NULL FILE*, but it would be nice to
confirm this.

I do not have a good idea why /dev/null is missing. Right after the
mount --move we add a compat symlink from /dev/ to /root/dev, so that
udev workers and their callouts should in theory even be able to
continue after the mount --move. We did not have this compat symlink
in udev 175; in theory this should make things worse, but just to be
sure to cover all bases, here's:

  (5) comment out the last line (ln -s) in
      /usr/share/initramfs-tools/scripts/init-bottom/udev

(plus "sudo update-initramfs -u"), and see whether that makes a
difference. In theory it should lead to lost events and more errors
like "/dev/null missing", but perhaps we are missing some weird check
in the lvm2/dmsetup scripts/rules.

Thanks,

Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

Stefan Bader (smb) wrote :

(1) Struggeling a bit with weirdly different behaviour between boots as Xen host and "normal" boots. First there was "debug" added for the Xen boot. This should only cause kernel messages of lower priority to be sent to the console, but it seems to have the effect of not showing any output of the initramfs hooks.

Second, for some reason a normal boot will take quite a while between udevadm control --exit starting and finishing. But when it finishes, no processes with udevd in they name remain. Booting in Xen mode there is one systemd-udevd crashing immediatly after calling udevadm control --exit and that ends the udevadm command (so it does not wait). After that it takes another 25s until the remaining systemd-udevd processes. That would explain that those missing /dev/null messages cannot be reproduced without booting in Xen mode.

In both cases it seems that at least one (in the current case 2) instance(s) of "sh -c vgscan; vgchange -ay" are still running (and the actually stuck part seems to be vgchange). And that seems to trigger (in both cases after udevadm control --exit has returned) the messages about the timeout when trying to kill (one of?) these. That results in some of the LVs not being properly set up. So far the /dev/dm-X entry was present (dmsetup ls also shows them) but the symlinks from /dev/mapper/ and /dev/<vgname>/ pointing back to /dev/dm-X are missing. And that can result in mountall waiting. Which in turn seems to fail showing or accepting anything to skip when I redirect the console to a serial device (if I don't it is hard to look at old messages and secondly I usually prefer to use the serial over lan to interact with that server).

Stefan Bader (smb) wrote :

I think the long report for (1) also covers question (4).

Stefan Bader (smb) wrote :

(1) Addition, seems I managed to see one rare-time segfault on a normal boot (but only one so far on many reboots).

(3) Ok, finally I got it! Updated udev/libudev to 202-0ubuntu10(amd64) to be able to have matching ddebs and disabled mounting /home and /home/isos as otherwise the plymouth bug prevented me to access /run after the failure...

Core was generated by `/lib/systemd/systemd-udevd --daemon --resolve-names=never'.
Program terminated with signal 11, Segmentation fault.
#0 udev_monitor_allow_unicast_sender (
    udev_monitor=udev_monitor@entry=0x6e0110, sender=0x0)
    at src/libudev/libudev-monitor.c:311
311 src/libudev/libudev-monitor.c: No such file or directory.
(gdb) where
#0 udev_monitor_allow_unicast_sender (
    udev_monitor=udev_monitor@entry=0x6e0110, sender=0x0)
    at src/libudev/libudev-monitor.c:311
#1 0x00000000004063fe in worker_new (event=0x72b5e0) at src/udev/udevd.c:210
#2 event_run (event=0x72b5e0) at src/udev/udevd.c:437
#3 event_queue_start (
    udev=<error reading variable: can't compute CFA for this frame>)
    at src/udev/udevd.c:586
#4 main (argc=<optimized out>, argv=<optimized out>) at src/udev/udevd.c:1448

Stefan Bader (smb) wrote :
Stefan Bader (smb) wrote :

Also a braindump it gets late. Tried to figure out what changed between the udev-175 udevd and the new one. Taking away the whitespace changes. Most things are log function renames. Thought there is a hunk missing that would copy dev entries but that might have been old support code for tmpfs /dev.
There is also a related change missing from us which was handling events with timeout as urgently as dm cookie ones. And it looks a bit as the number of workers got reduced from min 128 to 8 + <nr cpus> * 2.

Stefan Bader (smb) wrote :

So summarizing: Maybe two problems:

1. Boot as Xen host, higher chance of having the segfault resulting in udevadm returning before workers have ended and that yields /dev/null not found.

2. Some problem in the sequence leading to missed events. Could be that vgchange running after udevd terminates. Any events created there probably are lost...

Stefan Bader [2013-05-29 17:32 -0000]:
> There is also a related change missing from us which was handling
> events with timeout as urgently as dm cookie ones.

Right, this was our old avoid-exit-deadlock-for-timely-events patch:
That was applied for http://pad.lv/842560. Andy discussed that with
upstream in http://comments.gmane.org/gmane.linux.hotplug.devel/17206
and as a result the whole TIMEOUT legacy stuff has been removed, so
this does not apply any more. Also, firmware handling has changed
quite a bit (e. g. recent kernels load it by themselves). So I dropped
the patch.

Martin Pitt (pitti) wrote :

Thanks for your investigations!

Stefan Bader [2013-05-29 13:42 -0000]:
> Booting in Xen mode there is
> one systemd-udevd crashing immediatly after calling udevadm control
> --exit and that ends the udevadm command (so it does not wait).

That indeed sounds like the most plausible cause for now, and in any
way udevd should not crash; let's fix the crash first, and then see
how much further this gets us.

Thanks for the core file! I can load it fine with current
202-0ubuntu11 version, and get the same bt as you. The actual crash
seems obvious:

  #0 udev_monitor_allow_unicast_sender udev_monitor=udev_monitor@entry=0x6e0110, sender=0x0)

with

  udev_monitor->snl_trusted_sender.nl.nl_pid = sender->snl.nl.nl_pid;

The caller does

        if (worker_monitor == NULL)
                return;
        /* allow the main daemon netlink address to send devices to the worker */
        udev_monitor_allow_unicast_sender(worker_monitor, monitor);

I. e. "monitor" is not checked for being NULL here; but it is (see
frame #0), thus causing the crash.

The global monitor gets initialized in main(), and after the
initialization there is a NULL check. However, further down in main()
it is unreffed and set to NULL if "udev_exit" becomes true, i. e. the
worker got a SIGTERM. As far as I can see, this is incompatible with
our 0024-avoid-exit-deadlock-for-dm_cookie.patch patch which keeps
processing events even after an exit has been requested; in that case,
we must not access monitor any more.

Martin Pitt (pitti) wrote :

Could you please try 202-0ubuntu11pitti1 from https://launchpad.net/~pitti/+archive/ppa ? (It's the only package for saucy, thus safe for dist-upgrading).

Stefan Bader (smb) wrote :

I have done about 5 reboots with the updated packages, and I can confirm the crash on a Xen boot to be gone. Interestingly now the init-bottom is taking a while as expected (maybe 5s) and all of the LVs on the VG with the large number of LVs are correctly set up (having the additional links). Just the other VG seems to be missing completely all the times.

Stefan Bader (smb) wrote :

Now done another 5 reboots with the udev script in init-bottom doing a "udevadm settle --timeout=121" before the exit. The delay before init-bottom finishes seem the same (or even a little bit quicker) than before, a ps after the exit has no udevd or vgscan/vgchange processes running (with only udevadm exit there were multiple of those with the watershed parent gone), and all logical volumes are set up correctly.

Martin Pitt (pitti) wrote :

Stefan and I debugged and discussed that. The new udev has a reduced number of parallel workers now (performance optimization, see Harald Hoyer's measurements), which potentially leads to more serialization for a huge number of events (as we get in initramfs by calling udevadm trigger). This is one of the relevant changes between 175 and 202, as Stefan pointed out in comment 7.

Our 0024-avoid-exit-deadlock-for-dm_cookie.patch only keeps those workers running which process an uevent with DM_COOKIE. But that doesn't catch the majority of the LVM related events (only the kernel-generated change events on LVs after the virtual->physical mapping has been set up). All the add events, as well as the change events for the other block devices would still go to nirvana (i. e. the time between shutting down udev in initramfs and bringing it back up in the real system).

We tried to trigger a change event for the missing LVs, but that does not work as "vgchange -a y" already has all the /dev/dm-* set up, sees no change, and thus does not cause a proper change event for the LV to be generated. Hence the symlinks aren't created by our /etc/init/udevtrigger.conf job. What does work is to "dmsetup remove" the affected LVs and then call vgchange to re-add them, but finding the affected LVs and doing that for all of them would duplicate a lot of the work which is already happening in initramfs anyway, and thus be rather expensive. It seems much more efficient to me to just do that detection once, and just wait in initramfs until all LVs are complete.

http://people.canonical.com/~pitti/tmp/systemd/ (202-0ubuntu11pitti2) does that approach now: It "udevadm settle"s in udev's initramfs-bottom script IF lvm2 is installed (we could also check for /sys/block/dm-*, but that's racy; we rather err on the side of doing an unnecessary settle), and disables 0024-avoid-exit-deadlock-for-dm_cookie.patch as the settle achieves the same goal in a more complete way.

Martin Pitt (pitti) on 2013-06-03
Changed in systemd (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → Martin Pitt (pitti)
status: In Progress → Fix Committed
Stefan Bader (smb) wrote :

Updated to 202-0ubuntu11pitti2 (which dropped the dm cookie patch but added a settle before exit in init-bottom/udev) and re-ran the reboot tests. 5-6 times to Xen and once to non-Xen. All runs did set up all LVs. Time waiting in settle/exit seems to be not longer (but maybe even quicker) than before. Not related to this bug, but before there were (even in older releases / Q and R) often messages about udev failing to do an inotify_add_watch, which I cannot see at all with the settle. And on reboot I saw that terminating processes seemed to not produce the "ended after x s" message and the killing remaining processes failed. This seems to be gone, too.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 202-0ubuntu12

---------------
systemd (202-0ubuntu12) saucy; urgency=low

  * Rename debian/tests/localed to locale-locale, as we are going to add more
    tests for localed.
  * Add debian/tests/localed-x11-keymap: Test "localectl set-x11-keymap".
  * Put back 0010-Add-back-support-for-Debian-specific-config-files.patch to
    re-fix timedated, and add Steve's fix for /etc/default/keymap. Re-drop
    0025-Adjust-localed-to-use-the-Debian-locations-of-etc-de.patch.
  * debian/extra/initramfs.bottom: We cannot properly synthesize LVM LV change
    events with udevadm trigger during boot, so if we use LVM, we need to let
    udev finish all the pending events in the initramfs. The
    0024-avoid-exit-deadlock-for-dm_cookie.patch approach was
    inherently racy, as it still left vgchange subprocesses running from
    watershed (which then run into a non-existing /dev), so disable that
    patch. (LP: #1185394)
 -- Martin Pitt <email address hidden> Mon, 03 Jun 2013 14:41:20 +0200

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers