Serialization problem of udev events with DM_COOKIE set

Bug #1017715 reported by Herton R. Krzesinski on 2012-06-25
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
udev (Ubuntu)
Medium
Herton R. Krzesinski
Oneiric
Medium
Unassigned
Precise
Medium
Unassigned

Bug Description

Alex Lyakas reports:

Hi Herton,

we are doing initial testing of our software on stock ubuntu-precise
3.2.0-23 - 36. It looks like we have identified a problem with the udevd
code. Here are some details:
We believe the problem was introduced by a fix to
https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/802626. The fix
introduced code like this:
in event_queue_insert():
/* run all events with a timeout set immediately, or in the case
 * it's a dm_cookie event being processed */
if (udev_device_get_timeout(dev) > 0 ||
    udev_device_get_dm_cookie_set(dev)) {
    event_run(event, true);
    return 0;
}
Basically, what this fix does, if the udev event has a DM_COOKIE, then
this event is dispatched immediately. Previously, this event was going
through the event_queue_start(), which was checking that there are no
in-flight events for the same device (or parent, or child etc.) by
calling is_devpath_busy(). With the fix, this doesn’t happen and event
is dispatched immediately.
What we see that when a new device-mapper device is created in the
system, an “add” event is fired, then “change” event comes in (that has
a DM_COOKIE). As a result of the fix, the “change” processing is not
delayed until the “add” processing completes. As a result, both events
are handled concurrently by two different udev-workers. This causes
several different unwanted effects that we observed.
One such effect is in the attached log: dm-16 device is being created.
“add” and “change” events are fired:
May 24 20:26:37 vc-00-00-A-dev udevd[5859]: seq 100141 queued, 'add'
'block'
May 24 20:26:37 vc-00-00-A-dev udevd[5859]: seq 100142 queued, 'change'
'block'
May 24 20:26:37 vc-00-00-A-dev udevd[26633]: seq 100142 running
The run in parallel and “change” processing starts first. Then “add”
processing starts:
May 24 20:26:37 vc-00-00-A-dev udevd[26544]: seq 100141 running
“change” processing creates symbolic links
May 24 20:26:37 vc-00-00-A-dev udevd[26633]: creating link '/dev/disk
/by-id/dm-name-vpart-5555' to '/dev/dm-16'
May 24 20:26:37 vc-00-00-A-dev udevd[26633]: creating symlink '/dev/disk
/by-id/dm-name-vpart-5555' to '../../dm-16'
May 24 20:26:37 vc-00-00-A-dev udevd[26633]: creating link
'/dev/mapper/vpart-5555' to '/dev/dm-16'
May 24 20:26:37 vc-00-00-A-dev udevd[26633]: creating symlink
'/dev/mapper/vpart-5555' to '../dm-16'
but then “add” processing removes them:
May 24 20:26:37 vc-00-00-A-dev udevd[26544]: update old name, '/dev/disk
/by-id/dm-name-vpart-5555' no longer belonging to
'/devices/virtual/block/dm-16'
May 24 20:26:37 vc-00-00-A-dev udevd[26544]: no reference left, remove
'/dev/disk/by-id/dm-name-vpart-5555'
May 24 20:26:37 vc-00-00-A-dev udevd[26544]: update old name,
'/dev/mapper/vpart-5555' no longer belonging to
'/devices/virtual/block/dm-16'
May 24 20:26:37 vc-00-00-A-dev udevd[26544]: no reference left, remove
'/dev/mapper/vpart-5555'
As a result, there is no /dev/mapper/XXX symbolic link.
We have also seen some other bad effects of this parallel processing
like:
# bad symbolic link is created in /dev/disk/by-path (this happens in
“add” processing, when there is no DM_NAME property):
ll /dev/disk/by-path:
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name- -> ../../dm-4
lrwxrwxrwx 1 root root 10 May 23 21:45 dm-name-ioerror -> ../../dm-0
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name-ppart-100 -> ../../dm-5
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name-ppart-103 -> ../../dm-3
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name-ppart-105 -> ../../dm-2
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name-ppart-108 -> ../../dm-1
lrwxrwxrwx 1 root root 10 May 23 21:47 dm-name-ppart-109 -> ../../dm-6
# devnode is created in /dev/mapper, instead of symbolic link:
root@vc-00-00-A-dev:~# ll /dev/mapper/
total 0
drwxr-xr-x 2 root root 140 May 23 22:01 ./
drwxr-xr-x 14 root root 9900 May 23 22:01 ../
lrwxrwxrwx 1 root root 7 May 23 22:01 blabla -> ../dm-1
crw------- 1 root root 10, 236 May 23 19:09 control
lrwxrwxrwx 1 root root 7 May 23 21:53 ioerror -> ../dm-0
lrwxrwxrwx 1 root root 7 May 23 22:01 ppart-4927362 -> ../dm-2
brw-rw---- 1 root disk 252, 3 May 23 22:01 ppart-4927363 // This is a
devnode, not symlink!
and some others.
We see this issue only when doing stress testing of the udev system,
like creating and deleting devices, and then creating them again and
deleting again etc.
To resolve the issue, we disabled the code that handles the DM_COOKIE in
udevd. Then we stopped seeing issues.
All in all, our understanding is that two udev events for the same
devpath should not be executing concurrently, which this fix violates
for Device-Mapper devices. Also, this problem cannot be easily addressed
by playing with udev rules, because the flaw is in the udevd code
itself.

Can you pls share your view on the problems we see?

Thanks,
 Alex.

Related branches

Herton R. Krzesinski (herton) wrote :

Indeed there is a serialization issue regarding the DM_COOKIE events. I was able to reproduce with the test case attached, that must be run as root.

Simply let the test case running, it stress test the creation/removal of a dm device. On the buggy udev, it'll fail at some point, usually leaving a /dev/disk/by-id/dm-name- link, lacking the device name. On a good udev, the test case will run indefinitely without errors.

The problem only happens when stress testing device creation/removal so far, unlikely this to be an issue in practice, where usually no devices are removed.

Herton R. Krzesinski (herton) wrote :

Following this are the attached proposed patches to fix this issue on Oneiric/Precise/Quantal. The fix is to run the event with
DM_COOKIE only if its parent or child event isn't still running. I verified also that bug 802626 remains fixed, and the regression (this bug) doesn't happen anymore.

Herton R. Krzesinski (herton) wrote :
Herton R. Krzesinski (herton) wrote :
Herton R. Krzesinski (herton) wrote :
tags: added: patch
Alex Lyakas (alyakas) wrote :

Herton, thanks for posting this issue. Is this the same patch that you sent me by email? I will try to test it as soon as I can.
Thanks,
   Alex.

Herton R. Krzesinski (herton) wrote :

Yes the patch is similar, from what I remember I just did in addition a small cleanup.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 175-0ubuntu10

---------------
udev (175-0ubuntu10) quantal; urgency=low

  * debian/patches/fix-race-condition-of-serio-driver-module-not-
    loading.patch: Fix a race condition at boot with serio devices.
    Thanks to Eric Miao <email address hidden>. Closes LP: #1014460.
  * Update Vcs-Bzr for the current release branch.

  [ Herton Ronaldo Krzesinski ]
  * Refresh debian/patches/avoid-exit-deadlock-for-dm_cookie.patch:
    avoid serialization issues in handling of events with DM_COOKIE
    set (LP: #1017715).
 -- Steve Langasek <email address hidden> Tue, 03 Jul 2012 00:25:55 -0700

Changed in udev (Ubuntu):
status: In Progress → Fix Released
Alex Lyakas (alyakas) wrote :

Will 175-0ubuntu10 be available also for precise? Currently, precise still updates to 175-0ubuntu9. I want to test the fix.

Steve Langasek (vorlon) on 2012-07-04
Changed in udev (Ubuntu Oneiric):
status: New → Triaged
importance: Undecided → Medium
Changed in udev (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium
Martin Pitt (pitti) wrote :

Uploaded to precise-proposed queue, waiting for SRU team to review.

Changed in udev (Ubuntu Precise):
status: Triaged → Fix Committed
Martin Pitt (pitti) wrote :

oneiric update uploaded to -proposed queue, too.

Changed in udev (Ubuntu Oneiric):
status: Triaged → Fix Committed

Hello Herton, or anyone else affected,

Accepted udev into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/udev/175-0ubuntu9.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Herton R. Krzesinski (herton) wrote :

I verified the precise packages with the testcase and they are working fine and fix the issue (marking thus verification-done). There are no oneiric packages in proposed yet.

tags: added: verification-done
removed: verification-needed
Clint Byrum (clint-fewbar) wrote :

Hello Herton, or anyone else affected,

Accepted udev into oneiric-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/udev/173-0ubuntu4.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Herton R. Krzesinski (herton) wrote :

Checking with the test case, the packages for Oneiric in -proposed fix the issue.

tags: added: verification-done
removed: verification-needed

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 175-0ubuntu9.1

---------------
udev (175-0ubuntu9.1) precise-proposed; urgency=low

  [ Herton Ronaldo Krzesinski ]
  * Refresh debian/patches/avoid-exit-deadlock-for-dm_cookie.patch:
    avoid serialization issues in handling of events with DM_COOKIE
    set (LP: #1017715).

  [ Eric Miao ]
  * fix-race-condition-of-serio-driver-module-not-loading.patch:
    Work around race condition in PS/2 keyboard/mouse handling. This resulted
    in udev sometimes failing to load the psmouse driver, causing touchpads to
    occasionally not work after boot up. (LP: #1014460)
 -- Herton Ronaldo Krzesinski <email address hidden> Mon, 16 Jul 2012 07:17:56 +0200

Changed in udev (Ubuntu Precise):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 173-0ubuntu4.3

---------------
udev (173-0ubuntu4.3) oneiric-proposed; urgency=low

  [ Herton Ronaldo Krzesinski ]
  * Refresh debian/patches/avoid-exit-deadlock-for-dm_cookie.patch:
    avoid serialization issues in handling of events with DM_COOKIE
    set (LP: #1017715).

  [ Eric Miao ]
  * fix-race-condition-of-serio-driver-module-not-loading.patch:
    Work around race condition in PS/2 keyboard/mouse handling. This resulted
    in udev sometimes failing to load the psmouse driver, causing touchpads to
    occasionally not work after boot up. (LP: #1014460)
 -- Herton Ronaldo Krzesinski <email address hidden> Mon, 16 Jul 2012 07:25:22 +0200

Changed in udev (Ubuntu Oneiric):
status: Fix Committed → Fix Released
Alex Lyakas (alyakas) wrote :

Tested with 175-0ubuntu9.1 on Precise; looks like the issue is fixed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments