ISST-SAN:KVM:R3-0: LVM udev rules deadlock with large number of PVs

Bug #1560710 reported by bugproxy on 2016-03-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lvm2 (Ubuntu)
Undecided
Martin Pitt
watershed (Ubuntu)
Undecided
Unassigned

Bug Description

Oringinal problem statement:

Today I reinstalled lucky03.Installation went fine.After Installation it got rebooted and after reboot ,Unable to get the login prompt

== Comment: #3 - Kevin W. Rudd - 2016-03-14 18:49:02 ==
It looks like this might be related to bug 124628 . I was able to get to a login prompt by adding the following to the boot options:

udev.children-max=500

Lekshmi,

Can you provide additional information on how you did the install for this lpar? It would be nice to replicate the exact install sequence from the beginning in order to try to capture some additional debug information.

== Comment: #18 - Mauricio Faria De Oliveira - 2016-03-22 16:59:40 ==
It's possible to reproduce this on a qemu-kvm guest w/ emulated hard disks (i.e., image/file-backed disks).

Configuration:
- 1 disk w/ 16.04 (no LVM required)
- 50 disks (w/ LVM volumes)

# ps l
...
S 0 7014 145 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7015 144 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7016 140 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7017 139 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7018 142 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7019 7014 3520 1280 0:0 20:53 00:00:00 sh -c /sbin/lvm vgscan; /
S 0 7020 137 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7021 143 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7023 136 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7024 141 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7025 138 3008 1216 0:0 20:53 00:00:00 /lib/udev/watershed sh -c
S 0 7026 7019 10560 9344 0:0 20:53 00:00:00 /sbin/lvm vgchange -a y
...

# cat /proc/7014/stack
[<c0000000034e3b20>] 0xc0000000034e3b20
[<c000000000015ce8>] __switch_to+0x1f8/0x350
[<c0000000000bc3ac>] do_wait+0x22c/0x2d0
[<c0000000000bd998>] SyS_wait4+0xa8/0x140
[<c000000000009204>] system_call+0x38/0xb4

# cat /proc/7019/stack
[<c0000000031cfb20>] 0xc0000000031cfb20
[<c000000000015ce8>] __switch_to+0x1f8/0x350
[<c0000000000bc3ac>] do_wait+0x22c/0x2d0
[<c0000000000bd998>] SyS_wait4+0xa8/0x140
[<c000000000009204>] system_call+0x38/0xb4

# cat /proc/7026/stack
[<c0000000031aba80>] 0xc0000000031aba80
[<c000000000015ce8>] __switch_to+0x1f8/0x350
[<c000000000463160>] SyS_semtimedop+0x810/0x9f0
[<c0000000004661d4>] SyS_ipc+0x154/0x3c0
[<c000000000009204>] system_call+0x38/0xb4

# dmsetup udevcookies
Cookie Semid Value Last semop time Last change time
0xd4d888a 0 1 Tue Mar 22 20:53:55 2016 Tue Mar 22 20:53:55 2016

== Comment: #19 - Mauricio Faria De Oliveira - 2016-03-22 17:00:13 ==
Command to create the LVM volumes on initramfs:

# for sd in /sys/block/sd*; do sd="$(basename $sd)"; [ "$sd" = 'sda' ] && continue; lvm pvcreate /dev/$sd; lvm vgcreate vg-$sd /dev/$sd; lvm lvcreate --size 1000m --name lv-$sd vg-$sd; done

# lvm vgdisplay | grep -c 'VG Name'
50

== Comment: #20 - Mauricio Faria De Oliveira - 2016-03-22 17:57:50 ==
Hm, got a better picture of this:

The problem doesn't seem to be a synchronization issue.
I've learned a bit more about the udev events/cookies for sdX and lvm volumes.

1) The sdX add events happen, for multiple devices.
    Each device consumes 1 udev worker.

2) The sdX add events run 'watershed .. vgchange -a y...'.
     If this detects an LVM volume in sdX, it will try to activate it, and then block/wait for the respective LVM/udev cookie to complete (i.e., wait for the LVM dm-X device add event to finish)

3) The dm-X device add event is fired from the kernel.

4) There are no available udev workers to process it.
    The event processing remains queued.
    Thus, the cookie will not be released.

5) No udev workers from sdX devices will finish, since all are waiting for cookies to be complete, which demand available udev workers.

== Comment: #21 - Mauricio Faria De Oliveira - 2016-03-22 18:02:11 ==
Got a confirmation of the previous hypothesis.

Added the --noudevsync argument to the vgchange command in the initramfs's /lib/udev/rules.d/85-lvm2.rules.
This causes vgchange not to wait for an udev cookie.

Things didn't block, actually finished quite fast.

== Comment: #23 - Mauricio Faria De Oliveira - 2016-03-22 18:14:41 ==
Hi Canonical,

May you please help in finding a solution to this problem?
If I recall correctly, @pitti works w/ udev and early boot in general.

The problem summary (from previous comments) is:

1) There are more SCSI disks w/ LVM volumes present than the max number of udev workers. Each disk consumes 1 udev workers.

2) When the add uevent from each disk runs 85-lvm2.rules, the call to 'vgchange -a y' will detect LVM volume(s) and activate them. This fires an add uevent for a dm-X device from the kernel. And vgchange blocks waiting for the respective udev cookie to be completed.

3) The add uevent for dm-X has no udev workers to run on (all taken by the SCSI disks, which are blocked on calls to vgchange (or watershed, which is also blocked/waiting for one vgchange to finish), and thus the udev cookie related to dm-X will not be completed.

4) If that cookie is not completed, that vgchange won't finish either.

It's a deadlock, afaik.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-138931 severity-critical targetmilestone-inin1604

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Kevin W. Rudd (kevinr) on 2016-03-22
affects: ubuntu → udev (Ubuntu)

Indeed - Martin, can you have a look at this bug? As I recall, our udev+lvm2 handling diverges still from upstream's model. This doesn't seem like something we want to change the month before LTS, but a workaround might be in order...

Changed in udev (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Martin Pitt (pitti)

Thanks, Steve.

One workaround available is to increase the number of udev workers with the 'udev.children-max=<number of disks + something>' option in the kernel cmdline.

The timeframe for bigger changes is indeed not the best one; but perhaps, other than a workaround, there may be some sort of solution available that's more conservative, by someone skilled on this area.

For example, I had the impression the default number of max udev workers decreased from 40 on previous releases to 24 now (or recently). Maybe some other changes got in that might be related too.

Indeed this changed two years ago (https://github.com/systemd/systemd/commit/8cc3f8c0bcd) after some measurements which number of parallel workers gives the best throughput: https://plus.google.com/+HaraldHoyer/posts/eRJFhjLbpta

But regardless of how many you allow, with that many LVs it's still probable to use up all workers. The root problem is indeed our LVM udev rule which blocks, which is a big no-no. Perhaps that's the reason why it has never been accepted upstream or in Debian.

I'm not that familiar with LVM, but I think Scott's big selling point back then was that with udev rules you'd get proper LVM detection and buildup for hotpluggable devices. But that was some 10 years ago, maybe this is fixed with upstream's current rules now.

I agree that trying to completely change the structure of this at this point in xenial is rather risky, so I'll first see if we can tone down that udev rule -- there is absolutely no point in running that many watershed's in parallel, as they all do exactly the same thing. I actually thought the whole point of watershed was to prevent this, so maybe it's just watershed itself which is broken.

Martin Pitt (pitti) wrote :

Adam, Steve, and I discussed this on IRC, and we agree that it's best to fix watershed for this. It's not helpful that all instances of those stay around indefinitely, thus blocking the udev worker. It would be better if every new instance would just increase some time stamp in the /run/watershed/ data and immediately exit, and the first instance just keeps on waiting while that is being increased.

affects: udev (Ubuntu) → watershed (Ubuntu)
Changed in watershed (Ubuntu):
status: New → Triaged
Steve Langasek (vorlon) wrote :

This would be an API change to watershed, so should be handled as a new non-default option.

The current watershed code does not readily lend itself to being changed in this way, but I still think it's the best solution.

Martin Pitt (pitti) wrote :

For the record: I added some debug logging and an automatic test to bzr.

Martin Pitt (pitti) on 2016-03-29
summary: - ISST-SAN:KVM:R3-0:Unable to get the login prompt after reinstallation of
- Ubuntu16.04
+ ISST-SAN:KVM:R3-0: LVM udev rules deadlock with large number of PVs
Changed in lvm2 (Ubuntu):
status: New → Triaged
Martin Pitt (pitti) on 2016-03-29
Changed in watershed (Ubuntu):
status: Triaged → In Progress
Martin Pitt (pitti) wrote :

I thought about such a "master/slave" implementation in watershed for a while, and this is a bit tricky. The "slaves" need to communicate their desire to run the command to the master, and both the master and slaves need to use locking to ensure that only one instance is the master and runs the command. However, the acts of "(try to) acquire the lock" and "refresh the desire to run the command" need to happen atomically, otherwise race conditions occur. E. g. in a slave instance some time might pass between the failure to acquire the lock and increasing/refreshing the stamp, and if the master finishes in between that last request is lost.

The standard tool for such a race free and atomic counter which is simultaneously a lock is a semaphore. However, both named and unnamed POSIX semaphores rely on /dev/shm/, and we cannot rely on that in udev rules (no /dev/shm in the initrd). We can use semaphores if we can assert that the udev rule is not crucial during early boot. I *think* it should be okay as it gets re-exercised after pivoting and everything gets mounted.

I wanted to compare the hotplug behaviour under Debian and Ubuntu. These commands can be run in a minimal VM:

   apt-get install -y lvm2
   reboot # lvm daemons don't seem to start right after installation; get a clean slate

   modprobe scsi_debug
   pvcreate /dev/sda
   vgcreate testvg /dev/sda
   lvcreate -L 4MB testvg

Now we have one PV, VG, and LV each, and an usable block device:

   lrwxrwxrwx 1 root root 7 Mar 30 08:01 /dev/testvg/lvol0 -> ../dm-0

Let's hot-remove the device. This does not automatically clean up the mapped device, so do this manually:

   echo 1 > /sys/block/sda/device/delete
   dmsetup remove /dev/testvg/lvol0

Now hotplug back the block device:

   echo '0 0 0' > /sys/class/scsi_host/host2/scan

Under *both* Debian and Ubuntu this correctly brings up the PV, VG, and LV, and /dev/testvg/lvol0 exists again. I can even remove our udev rule 85-lvm2.rules, update the initrd, reboot, and run the above test.

Thus it seems our Ubuntu specific udev rule is entirely obsolete. Indeed these days /lib/udev/rules.d/69-lvm-metad.rules (which calls pvscan --cache --activate) and lvmetad seem to be responsible for that, see /usr/share/doc/lvm2/udev_assembly.txt. So it seems we are now just doing extra work for no benefit.

I also noticed this in our Ubuntu delta description:

        - do not install activation systemd generator for lvm2, since udev starts LVM.

The activation generator is relevant if the admin disabled lvmetad, then the generator builds up the VGs at boot time. It's a no-op if lvmetad is enabled. We should put that back to match the current documentation and reduce our delta.

Martin Pitt (pitti) on 2016-03-30
Changed in lvm2 (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti) wrote :

This is the lvm2 debdiff that I propose for xenial. I also uploaded it to https://launchpad.net/~pitti/+archive/ubuntu/ppa . I tested this lightly in a VM, but I'll now do a full LVM2 install with that version.

Stefan Bader (smb) wrote :

I did some quick test with the ppa version on a system with 39 LVs. Rebooted 5 times (3x Xen dom0 mode and 2x normal Linux mode) and at least did not notice any regression (all 39 LVs hat links in /dev/<vgname>/... and no obvious deadlock).

Martin Pitt (pitti) wrote :

I also tested this with a full Ubiquity LVM (on cryptsetup) install and several reboots. Thanks Stefan!

Changed in watershed (Ubuntu):
status: In Progress → Triaged
Changed in lvm2 (Ubuntu):
status: Triaged → In Progress
tags: added: patch
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lvm2 - 2.02.133-1ubuntu8

---------------
lvm2 (2.02.133-1ubuntu8) xenial; urgency=medium

  * Drop debian/85-lvm2.rules. This is redundant now, VGs are already
    auto-assembled via lvmetad and 69-lvm-metad.rules. This gets rid of using
    watershed, which causes deadlocks due to blocking udev rule processing.
    (LP: #1560710)
  * debian/rules: Put back initramfs-tools script to ensure that the root and
    resume devices are activated (lvmetad is not yet running in the initrd).
  * debian/rules: Put back activation systemd generator, to assemble LVs in
    case the admin disabled lvmetad.
  * Make debian/initramfs-tools/lvm2/scripts/init-premount/lvm2 executable and
    remove spurious chmod +x Ubuntu delta in debian/rules.

 -- Martin Pitt <email address hidden> Wed, 30 Mar 2016 10:56:49 +0200

Changed in lvm2 (Ubuntu):
status: In Progress → Fix Released
Martin Pitt (pitti) on 2016-04-01
Changed in watershed (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
Steve Langasek (vorlon) on 2016-04-06
Changed in watershed (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers