udev repeatedly generates "change" events for the same block device(s)

Bug #332270 reported by Michael Flaig on 2009-02-20
174
This bug affects 11 people
Affects Status Importance Assigned to Milestone
lvm2 (Ubuntu)
High
Unassigned
Jaunty
High
Unassigned
udev (Ubuntu)
High
Unassigned
Jaunty
High
Unassigned

Bug Description

SYMPTOMS

Early in the boot process (during initial ram-disk script processing) the PC appears to stop although there may be a lot of repeated disk activity, possibly accompanied by messages to screen of the form:

 UEVENT[1235217664.944992] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda (block)

This happens after an automatic package upgrade to udev to 138-1.

RECOVERY PROCEDURE

If the PC is unable to complete start-up, effectively locking you out, this is how to regain control and install a work-around until a solution is available.

 1. When the PC starts enter the GRUB menu.
 2. Select the Recovery entry and edit it (press "e") - this starts in single-user mode with minimal services.
 3. Select the "kernel" line and edit it (press "e")
 4. Add " break=top" to the end of the line and press Enter.
 5. With the "kernel" line highlighted press "b" to boot using the edited entry.

The boot process will stop when the initial ram-disk scripts begin and you'll be faced with a busybox shell.

Edit the rules that contain the problematic "watch" and remove it:

 for rule in /lib/udev/rules.d/*; do sed -i -e '/watch/s/watch//' $rule; done

Modify the init script so the boot can continue:

 sed -i -e '/run-init/s,${init},/bin/bash,' /init

Restart the boot process by pressing Ctrl+D

The system *should* now boot to a root prompt although there may still be a delay. Once there, edit the non-initrd system rules:

 for rule in /lib/udev/rules.d/*; do sed -i -e '/watch/s/watch//' $rule; done

And rebuild the initial ram-disk image so you don't have to go through this procedure at every boot:

To update just the kernel version that is currently running:

 update-initramfs -u -k `uname -r`

or to update for all installed kernels:

 update-initramfs -u -k all

Now restart the PC and hopefully the regular multiuser start-up should go normally.

-- TJ

-----------------------------
Since todays upgrade of udev jaunty is not able to boot anymore
affects also recovery mode, of course.

I can see messages like
/sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda (9367)
/sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (9366)
/sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (9365)

the messages appear in blocks of quite many, always repeating after 5 minutes or so and the number is counting up.

Harddisk LED is flashing rapidly but no hearable access to the hard disk
I think it has something to do with udev as this is in the last upgrade log

2009-02-20 20:59:24 upgrade libvolume-id1 137-2 138-1
2009-02-20 20:59:24 status unpacked libvolume-id1 138-1
2009-02-20 20:59:24 status unpacked libvolume-id1 138-1
2009-02-20 20:59:25 configure libvolume-id1 138-1 138-1
2009-02-20 20:59:25 status unpacked libvolume-id1 138-1
2009-02-20 20:59:25 status half-configured libvolume-id1 138-1
2009-02-20 20:59:25 status installed libvolume-id1 138-1
2009-02-20 20:59:26 upgrade udev 137-2 138-1
2009-02-20 20:59:27 status unpacked udev 138-1
2009-02-20 20:59:27 status unpacked udev 138-1
2009-02-20 20:59:34 upgrade libudev0 137-2 138-1
2009-02-20 20:59:34 status unpacked libudev0 138-1
2009-02-20 20:59:34 status unpacked libudev0 138-1
2009-02-20 20:59:41 configure udev 138-1 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status unpacked udev 138-1
2009-02-20 20:59:41 status half-configured udev 138-1
2009-02-20 20:59:41 status installed udev 138-1
2009-02-20 20:59:47 configure libudev0 138-1 138-1
2009-02-20 20:59:47 status unpacked libudev0 138-1
2009-02-20 20:59:47 status half-configured libudev0 138-1
2009-02-20 20:59:47 status installed libudev0 138-1

Martin Kossick (hacktick) wrote :

reassigning this bug to package udev

Tomasz Nowak (tomnowak) wrote :

  I've the same problem. After system was upgraded today hard drive is working all the time and system doesn't boot anymore.

Martin Kossick (hacktick) wrote :

Okay, thank you.

Setting status to confirmed.

For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status .

Changed in udev:
status: New → Confirmed
Martin Kossick (hacktick) wrote :

Okay, thank you.

Setting status to confirmed.

For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status .

Confirmed same on Jaunty amd64, Lenovo ThinkPad T500. Jaunty *does* eventually boot, but by a rough wall-clock, it takes 90mins +/- 5, and the laptop's scorching afterward.

Disk check of some kind? I noticed there was a dmsetup upgrade, and I was focused on that-- I'm using lvols for /, /var, and /tmp.

Anders Kaseorg (andersk) wrote :

Me too. The problem goes away after downgrading dmsetup, libdevmapper-event1.02.1, libdevmapper1.02.1 [2:1.02.27-4ubuntu4 → 2:1.02.27-4ubuntu3], and libudev0, libvolume-id1, udev [138-1 → 137-2].

I've tried downloading both the Feb 19 and 20 ISOs, and they appear to contain the 138-1 version of udev as well as the other upgraged packages.

How can I downgrade? Are there other sources?

David McBride (david-mcbride) wrote :

Greetings,

I also have this problem, and I think I might be able to shed some light on what's going wrong.

I'm running a slightly unusual configuration -- my root filesystem on my ThinkPad is on LVM, as it my home filesystem. Only a small /boot is stored on a conventional DOS partition. When I boot the latest Jaunty with the 138-1 udev updates, the boot stalls -- with constant disk activity -- during the initial udev device discovery stage.

I found that pressing CTRL-ALT-DELETE interrupts this process, but rather than rebooting the machine, causes the initial udev device settling process to abort and the boot sequence to continue.

The constant disk activity is caused by the LVM udev rules constantly firing.

I note from the changelog that udev introduces a new mechanism: it uses inotify to track changes to block devices; this is so that it can keep /dev/by-uuid/* and /dev/by-label/* links up to date.

I hypothesize that the constant disk activity on the (root) LVM filesystem is causing the LVM udev rules to constantly fire, thus resulting in more disk activity... which never (or seldom) stops long enough for the loop to be broken. And because of this loop, the udev settling process never terminates, so the machine is unable to finish booting.

Reverting the use-inotify-to-watch-LVM-volumes addition to udev would probably be a good idea until this can be sorted out. In the mean time, I'm going to see if I can find a binary package to downgrade to...

Cheers,
David

Michael Evans (mjevans1983) wrote :

Same here, I used the prior kernel (-7 not -8) and was able to boot BUT now udevd runs at ~50% of a core and spams messages like...

UEVENT[1235217664.735688] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda (block)
UEVENT[1235217664.759302] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759389] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759465] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759540] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759613] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759687] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759762] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759848] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759922] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.759997] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.944992] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda (block)
UEVENT[1235217664.947108] change /devices/virtual/block/dm-0 (block)
UEVENT[1235217664.952729] change /devices/virtual/block/dm-0 (block)
UEVENT[1235217664.957689] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.958017] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 (block)
UEVENT[1235217664.963004] change /devices/virtual/block/dm-1 (block)
UEVENT[1235217664.963441] change /devices/virtual/block/dm-1 (block)
UEVENT[1235217664.991615] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda2 (block)
UEVENT[1235217664.991673] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda2 (block)
UEVENT[1235217664.991720] change /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda2 (block)
UEVENT[1235217665.034513] change /devices/virtual/block/dm-3 (block)

non-stop...

Albert Damen (albrt) wrote :

The problem seems to come from the watch option added in 60-persistent-storage.rules.

1. Booting with break=premount drops me in busybox without problems. With break=mount the disk activity already starts.
2. Booting with break=premount, then starting udevd and calling udevadm trigger --action=change starts the disk activity
3. Same as 2, but first removing the watch options from 60-persistent-storage.rules and 65-dmsetup.rules does not trigger the disk activity. However, the disk activity now starts after / is mounted (/ is on lvm).
4. Same as 3, but also removing the watch option from the same 2 rules files in / makes the boot successful.
5. Setting udev's loglevel to debug, adding the watch option in 65-dmsetup.rules and connecting an external usb drive does not cause any problems.
6. Same as 5, but also adding the watch option to 60-persistent-storage.rules results in the high disk activity. syslog is quickly filled with messages.
7. After removing the watch option from 60-persistent-storage.rules again, I can connect the usb drive without problems again.

Attached is the syslog part from step 6, between connecting the usb drive and disconnecting it again.
The most surprising lines for me are:
Feb 21 13:29:20 jaunty udevd-event[16733]: device will be watched for changes
Feb 21 13:29:20 jaunty udevd-event[16733]: 'watershed sh -c '/sbin/lvm vgscan; /sbin/lvm vgchange -a y''

Why should vgscan start again after the watch is added?

The usb drive has 4 partitions. 2 are ext3, 1 is swap and 1 is LVM:
$ sudo fdisk -l /dev/sdc

Disk /dev/sdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000e7a91

   Device Boot Start End Blocks Id System
/dev/sdc1 * 1 6119 49150836 83 Linux
/dev/sdc2 6120 12238 49150867+ 83 Linux
/dev/sdc3 12239 18357 49150867+ 8e Linux LVM
/dev/sdc4 18358 19457 8835750 82 Linux swap / Solaris

Michael Evans (mjevans1983) wrote :

Looking at other bugs, like this one...

https://bugs.launchpad.net/ubuntu/+source/speechd-up/+bug/316511

Could this be a new udev rule?

Um... This doesn't quite look as complicated as my gentoo system or my 8.10 gateway server...

ls -lR /etc/udev/
/etc/udev/:
total 8
drwxr-xr-x 2 root root 4096 2009-02-21 03:37 rules.d
-rw-r--r-- 1 root root 218 2009-01-20 08:14 udev.conf

/etc/udev/rules.d:
total 16
-rw-r--r-- 1 root root 55 2008-07-27 07:37 60-kqemu.rules
-rw-r--r-- 1 root root 1851 2008-10-19 23:11 70-persistent-cd.rules
-rw-r--r-- 1 root root 735 2008-09-11 05:14 70-persistent-net.rules
-rw-r--r-- 1 root root 1398 2009-01-20 08:14 README

Michael Evans (mjevans1983) wrote :

The udev files were moved, and I glanced at them but I'm no expert on udev files and nothing obviously related to /devices/pci0000:00/0000:00:09.0/host0/target0:0:0/0:0:0:0/block/sda/sda1 even stuck out.

Is there a guide to follow for troubleshooting this, logs that should be collected, or tests that are suggested for when I wake up tonight and try to start my laptop again?

TJ (tj) wrote :

I've put a test package in my PPA that may resolve this:

https://edge.launchpad.net/~intuitivenipple/+archive/ppa/
udev (138-2~tj~ppa1j) jaunty; urgency=low

  * Revert "Revert "rules: add "watch" option to dm and md rules" to prevent
    slow down due to inotify on dm devices (LP: #332270).

 -- TJ <email address hidden> Sat, 21 Feb 2009 17:10:00 +0000

I've not had the opportunity to restart the PC to test it and I'm still using udev 137-2 until I get chance to close everything down later today.

Phil Sung (psung) wrote :

Unfortunately, installing libudev0, libvolume-id1, and udev 138-2~tj~ppa1j doesn't fix the problem for me.

Phil Sung (psung) wrote :

Workaround: I can boot to the desktop under kernel -7 (but not -8)-- albeit very slowly. Then, "/etc/init.d/udev restart" seems to stop the disk thrashing.

Michael Evans (mjevans1983) wrote :

Tested, and works

< albert23> IntuitiveNipple: you need to drop the OPTIONS+="watch" line from 60-persistent-storage.rules. That's the one causing trouble with lvm. See comment 10 in the bug.
< IntuitiveNipple> ok, I've found the commit that applies the watch: f24036d
< IntuitiveNipple> "Device nodes can be watched for changes with inotify with OPTIONS="watch". If closed after being opened for writing, a "change" uevent will occur."
So... any time a close event happens a change event will be triggered... and presumably logged?
Which could then re-open and ...

edit /lib/udev/rules.d/60-persistent-storage.rules ; changing

# watch for future changes
KERNEL!="sr*", OPTIONS+="watch"

to

# watch for future changes
#KERNEL!="sr*", OPTIONS+="watch"

Then:

20090221-11:55:29 < IntuitiveNipple> If that patch applies ok, you can restart udev: sudo /etc/init.d/udev restart

Michael Evans (mjevans1983) wrote :

In case the odd wrapping causes confusion, the lines said by others have their nicknames in front, except when word-wrapped.

< IntuitiveNipple> "Device nodes can be watched for changes with inotify with OPTIONS="watch". If closed after being opened for writing, a "change" uevent will occur."
- -
So... any time a close event happens a change event will be triggered... and presumably logged?
Which could then re-open and ...

Michael Evans (mjevans1983) wrote :

20090221-12:13:33 < IntuitiveNipple> Sure... you do that, I'll do the new package and test it in the VM

Changed in udev:
assignee: nobody → intuitivenipple
status: Confirmed → In Progress
TJ (tj) wrote :

The updated fix is now available in my PPA. Instead of reverting selected patches I reset to the commit before the watch rules were added and then cherry-picked the two non-watch-rules:

udev (138-2~tj~ppa2j) jaunty; urgency=low

  * Revert "watch" option rules to prevent slow down due to
    inotify storm on devices (LP: #332270).

 -- TJ <email address hidden> Sat, 21 Feb 2009 17:10:00 +0000

Michael Evans (mjevans1983) wrote :

Tested 138-2~tj~ppa2j on my dual-core laptop with LVM and crypto-root home, etc.

Confirm the package also fixes it and boots with kernel 2.6.28-8.

For everyone still with 138-1; good luck getting inside of a prior kernel or chroot setup to install the update.

This is not a correct fix.

Please do not recommend people install it.

Instead it would be much more helpful if we could understand why you are getting such "event storms"

Daniel T Chen (crimsun) on 2009-02-22
Changed in udev:
importance: Undecided → Critical

 status incomplete

On Sat, 2009-02-21 at 11:21 +0000, David McBride wrote:

> Reverting the use-inotify-to-watch-LVM-volumes addition to udev would
> probably be a good idea until this can be sorted out. In the mean time,
> I'm going to see if I can find a binary package to downgrade to...
>
No.

Jaunty is a development release, and these kinds of problems are to be
expected.

Reverting simply means that we stop receiving bugs reports - that is
very undesirable.

I have not yet been able to reproduce this problem, so we need all the
bug reports we can get - and as much help from people who are
experiencing this problem that they can offer.

For the same reasons, I would ask that people _DO_NOT_ install the
workaround package posted by TJ.

Instead, please assist with debugging this problem so we can fix the
cause instead of the symptom.

Firstly I would like to collect information common to those affected
(since so far it appears limited to a small handful.)

If people could attach the output of "dpkg -l" on their systems, that
would be appreciated. Likewise please run "ls /sys/block", "lvm pvs",
"lvm vgs", and if you have any md partitions, "mdadm -D" on each.

Scott
--
Scott James Remnant
<email address hidden>

Changed in udev:
status: In Progress → Incomplete
Marc Deslauriers (mdeslaur) wrote :

I got hit by this bug also. It rendered my system unbootable.
I had to boot 2.6.28-7, ctrl-alt-del during boot and comment out the line from /lib/udev/rules.d/60-persistent-storage.rules to get the system bootable again (still need to boot with 2.6.28-7 though).

Attached are my files.

Marc Deslauriers (mdeslaur) wrote :
Marc Deslauriers (mdeslaur) wrote :
Marc Deslauriers (mdeslaur) wrote :
Marc Deslauriers (mdeslaur) wrote :

I have an encrypted swap file, if that makes a difference.

Michael Evans (mjevans1983) wrote :

My setup:

Dual core X2, 2gb ram, sata hard disk (sdaX), etc.

A windows vista partition (came with laptop -rarely- used... 'when I have to')
A normal boot partition
An LVM partition

crypttab
rootvolume /dev/ubuntu/encryptedroot none luks,retry=1
homevolume /dev/ubuntu/encryptedhome none luks,retry=1
mediavol /dev/ubuntu/media none luks,retry=1

fstab
/dev/mapper/rootvolume / ext3 relatime,errors=remount-ro 0 0
/dev/mapper/homevolume /home ext3 relatime 0 2
/dev/mapper/mediavol /media/m ext3 relatime 0 2
/dev/scd0 /media/cdrom0 udf,iso9660 user,noauto,exec,utf8 0 0

I am going to postulate where the problem originates by explaining an earlier statement I made.

"< IntuitiveNipple> "Device nodes can be watched for changes with inotify with OPTIONS="watch". If closed after being opened for writing, a "change" uevent will occur."
- -
So... any time a close event happens a change event will be triggered... and presumably logged?
Which could then re-open and ..."

I have three ext3 partitions with internal logging, all of which live inside of encrypted LVM containers. I am unsure where in the chain the problem occurs, but this is what I think may be happening.

1) Something modifies the state of the disk, logging info/etc.
2) That change constitutes a change event.
3) Change events are bad, but even worse, may be logged, feeding back in to step 1.

A recursive notification loop would thus form.

In conclusion, to quote a line I said in IRC "what lesson did this illustrate: Never inotify for normal operations of a special device file..."

Jonathan Hudson (jh+lpd) wrote :

The TJ "fix" did NOT fix anything for me. Same symptoms as most here:

Appears not to boot (within a reasonable time-frame) with 2.6.28-8, spewing out dozens of messages of the form:

/sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (9365)

to the console every five minutes or so.

2.6.28-7 does boot (slowly), lots of disk activity. udevd then takes up c. 50% of the CPU.

Disks are PATA (mobo i/f) and SATA (addon PCI card). In each case, LVM on top of RAID1.

Michael Evans (mjevans1983) wrote :

jh, can you get in or chroot and run:
sudo dpkg -l udev

You should see this:
ii udev 138-2~tj~ppa2j rule-based device node and kernel event manager

If you do, please run this: (I use -c since -u has broken the cryptsetup in the past.)
sudo update-initramfs -c -k all

If you do not, consider adding TJ's ppa OR following the removal of watch instructions I posted in an earlier comment.

https://edge.launchpad.net/~intuitivenipple/+archive/ppa/
http://tjworld.net/wiki/Linux/Ubuntu/Packages/MyPPA

sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0598C74B0FA42F8B9C82CC23727D031129047922
echo "deb http://ppa.launchpad.net/intuitivenipple/ppa/ubuntu $(lsb_release -sc) main" | sudo tee /etc/apt/sources.list.d/intuitivenipple.list
sudo apt-get update
sudo apt-get upgrade

Then check again:

To force that version use:
apt-get install udev=138-2~tj~ppa2j

hunger (hunger) wrote :

I am seeing the same issue.

My setup is:
sda1: windows
sda2: ubuntu root FS
sda3: empty space for other OSes
sda4: EXTENDED
sda5-9: LVM PVs

/var, /var/cache, /usr, /home and swap are all on LVs.

Commenting out the watch-line in udev rules (as suggested in <a href="https://bugs.launchpad.net/ubuntu/+source/udev/+bug/332270/comments/16">comment 16</a> does not fix the issue for me.

Luka Renko (lure) wrote :

Scott, agreed on need to find root cause for this on, particularly as workaround packages does not help me at all. ;-)

I have the same issue with up-to-date Kubuntu/Jaunty on HP n8440 laptop running everything on LVM.
I have small /boot (/dev/sda1), the rest of the disk is LVM pvol (/dev/sda2). I have several logical volumes (swap, /, /home, /misc...)

I have booted my system with Alternate CD rescue mode and have collected requested info. If you need anything let me know (it may take some time to provide as I have internet only in the reception of the hotel).

Regards,
Luka

Michael Evans (mjevans1983) wrote :

Interesting, I don't remember if I tried commenting it out with the first spin of the PPA package or the 138-1 version of the package.

Does installing the PPA version of the package fix it?

Peter Sabaini (peter-sabaini) wrote :

I am also seeing this on my laptop (Thinkpad X60). The system runs on an encrypted filesystem.

Peter Sabaini (peter-sabaini) wrote :
Peter Sabaini (peter-sabaini) wrote :
Peter Sabaini (peter-sabaini) wrote :
Michael Evans (mjevans1983) wrote :

If you've -only- commented it out like in my prior post: https://bugs.edge.launchpad.net/ubuntu/+source/udev/+bug/332270/comments/16

You -ALSO- need to update the initrd. The stale initrd will still have the bad rulesets in it.
sudo update-initramfs -c -k all

gunzip -dc initrd.img-2.6.28-8-generic | cpio -t | grep udev
lib/udev
lib/udev/edd_id
lib/udev/vol_id
lib/udev/ata_id
lib/udev/usb_id
lib/udev/path_id
lib/udev/rules.d
lib/udev/rules.d/85-lvm2.rules
lib/udev/rules.d/80-drivers.rules
lib/udev/rules.d/50-udev-default.rules
lib/udev/rules.d/64-device-mapper.rules
lib/udev/rules.d/60-persistent-storage.rules
lib/udev/rules.d/61-persistent-storage-edd.rules
lib/udev/rules.d/65-dmsetup.rules
lib/udev/rules.d/95-udev-late.rules
lib/udev/watershed
lib/udev/scsi_id
etc/udev
etc/udev/udev.conf
sbin/udevd
sbin/udevadm
scripts/nfs-top/udev

hunger (hunger) wrote :

More information to my earlier comment https://bugs.launchpad.net/ubuntu/+source/udev/+bug/332270/comments/32

I am using neither encrypted partitions nor a raid setup.

Attached you will find the requested info (in one tar ball).

Daniel Hahler (blueyed) on 2009-02-22
Changed in udev:
status: Confirmed → Triaged
Changed in udev:
importance: Critical → High
Changed in udev:
milestone: none → jaunty-alpha-5
43 comments hidden view all 123 comments
Steve Langasek (vorlon) wrote :

I confirm that I'm also seeing this problem here on a system with root+swap on crypt LVM as set up using the Ubuntu alternate 8.10 installer; so it's definitely a configuration that our installer can create. 'high' is fine for urgency, regardless.

chunglam (chunglam) wrote :

I have two machines with this issue, both using LVM2 plus 3ware raid card. One of the machines don't have any alternative way to boot then I install a fresh jaunty to another hard disk from my internal mirror using usb netboot, initially I can boot into the new jaunty because I forgot to enable the raid volume( I have boot manager to hide disk volume), but after I enable the raid volume thus also enable LVM2 the fresh jaunty refused to boot. Finally I have to downgrade the udev(and devmapper) to 137 to get access to my LVM2 volumes. I will send the logs later.

Martin Pool (mbp) wrote :

I'm also seeing this in what was originally 333073. This machine's configuration was made by fairly straightforward use of gutsy (iirc) installer options, nothing too exotic.

Michael Evans (mjevans1983) wrote :

If watch is the problem, what is it supposed to be doing anyway?

TJ (tj) wrote :

Having left the problem alone overnight it occurred to me this morning that maybe the issue here is the *multiple* watches being created by similar devices (block + lvm, block +md) in the same event run that is at the heart of this. This would explain why only systems with some combination of LVM, crypt, md or dmraid seem to be affected.

With my earlier observations about timing being significant (slowed down by udev logging, it can avoid the issue) I wonder if it's a case of the actions in 65-persistent-storage not having completed when 85-lvm2 (for example) is being processed and therefore creating a positive feedback loop?

From what I can gather child processes are forked so we could have a child processes triggering new inotify events unexpectedly.

I did wonder briefly what the effect is when an inotify watch is created on a node that is already covered by another inotify watch.

Michael Evans (mjevans1983) wrote :

TJ if that's the case then this is probably a flaw of the entire block IO system design.

In my mind, the ideal solution would be to have an abstraction which eventually reduces to a list of raw devices and block-ranges of action per device. When a watch or event occurs it would only be within that range. Multiple virtual devices might match against different ranges of raw block devices.

If the device, block-range, and some type of timestamp of the event were combined a unique key would be formed. Generating events only from the raw device and reflecting them up to directly linked virtual block abstractions would place an implicit limit in the data-structures and execution path.

  • trace Edit (1.4 MiB, text/plain; name="trace")

I have only lvm and md no crypt. For testing i removed mdadm and md
device, so it make no difference. After start udev create four instances
, the first one use 100% CPU. I attach strace of this process.

For other test i removed 85-lvm2.rules on booted system. Now udev
starting normally with only one instance, but i steal nead this file to
boot my pc.

Michael Evans (mjevans1983) wrote :

Yeah, Fishor if TJ is correct then what's happening is this.

Udev adds watches on any of these devices which you have...
disk/partition
lvm
crypto
raid
(any other device-mapper things)

Only, unfortunately, two or more of those are the same real device. A race condition (which would probably be aggravated by SMP systems) can then occur where a potentially infinite loop of self-chaining change events starts. I would presume that at some point it either overflows or bogs down to the point where so many instances are spawned that it causes the lag and then lets you past.

Oleksij Rempel (olerem) wrote :

> Only, unfortunately, two or more of those are the same real device. A
> race condition (which would probably be aggravated by SMP systems) can
> then occur where a potentially infinite loop of self-chaining change
> events starts. I would presume that at some point it either overflows
> or bogs down to the point where so many instances are spawned that it
> causes the lag and then lets you past.

hmm... seems to be true. I just disabled one core started udev and
enabled core again. It's working.

  echo 0 > /sys/devices/system/cpu/cpu1/online
  /etc/init.d/udev start
  echo 1 > /sys/devices/system/cpu/cpu1/online

Michael Evans (mjevans1983) wrote :

Another workaround -might- be possible within grub then.

During menu-selection hit e to edit
arrow over the kernel line
e to edit the line
add (after a space) to the end 'maxcpus=1' and it might boot.

I'll see if I can test that.

Michael Evans (mjevans1983) wrote :

Unfortunately that does not seem to work on my test setup.

Kirit Sælensminde (kayess) wrote :

I'm getting what appears to be this problem on my eee running jaunty with an encrypted disk. Using the -7 kernel it will boot, using -8 it won't. If I boot using -7 then restart udev the CPU utilisation reduces for a while, but then goes back up later on. The volume was created using the alternate installer for alpha 2 with the default options.

same problem here. Using raid+cryptsetup+lvm

Could boot kernel 2.6.28-7-server without changing anything. Booting up needs approximately 5min.

After ~10 hours /dev seems to be full?! Checked it using df :

df
Dateisystem 1K‐Blöcke Benutzt Verfügbar Ben% Eingehängt auf
/dev/mapper/vg-root 2064208 519092 1440260 27% /
tmpfs 488552 4 488548 1% /lib/init/rw
varrun 488552 276 488276 1% /var/run
varlock 488552 0 488552 0% /var/lock
udev 488552 488552 0 100% /dev
tmpfs 488552 0 488552 0% /dev/shm
/dev/mapper/vg-home 495844 10524 459720 3% /home
/dev/mapper/vg-opt 495844 10510 459734 3% /opt
/dev/mapper/vg-srv 598631504 478385084 89837716 85% /srv
/dev/mapper/vg-tmp 495844 38197 432047 9% /tmp
/dev/mapper/vg-usr 4128448 545064 3373672 14% /usr
/dev/mapper/vg-var 2064208 210828 1748524 11% /var
/dev/md0 497765 58874 413192 13% /boot

The appended munin graph shows the flooding.

Fabián Rodríguez (magicfab) wrote :

I have repeating messages like this one at boot (after switching to console), after Friday's updates:
  /sys/devices/virtual/block/dm-0 ($NR)

as noted in a previous comment. $NR is a decreasing number.

I was able to perform an upgrade on Saturday after botting into a previous kernel recovery option, which took quite some time. Once the upgrade was done I left the machine trying to boot on for the last two days (until this morning), without noticeable change or progress other than the number decreasing steadily, looping back again after reaching 0.

This install was done with a Jaunty alpha 3 later updated. Straight "use all disk + lvm + encryption" install from Alternate CD, so definitely a situation the installer can reproduce predictably.

I can't provide logs until I get back home (+9 hrs), though.

On Mon, 2009-02-23 at 11:32 +0000, MichaelEvans wrote:

> Only, unfortunately, two or more of those are the same real device. A
> race condition (which would probably be aggravated by SMP systems) can
> then occur where a potentially infinite loop of self-chaining change
> events starts. I would presume that at some point it either overflows
> or bogs down to the point where so many instances are spawned that it
> causes the lag and then lets you past.
>
inotify doesn't work this way.

Actually, it looks like something far simpler is happening:

udev runs "lvm vgscan" when it detects an LVM Physical Volume. It looks
like lvm is opening every single block device on the system FOR WRITING;
that triggers the inotify IN_CLOSE_WRITE event, and thus a udev change
event - which means udev runs "lvm vgscan" (repeat until up to
temperature).

It would be helpful if people could reply if THEY DO NOT have *any* LVM
Physical Volumes on their system, and are still experiencing this bug.

Scott
--
Scott James Remnant
<email address hidden>

Michael Evans (mjevans1983) wrote :

Scott, you have a strongly supported hypothesis, a problem domain of LVM and -any- of the other watch rulesets.

Looking over every comment in this bug and everyone that mentions system setup also seems to be running LVM in one way or another.

The topic in #ubuntu+1 should be updated to help collect the information you requested.

Oleksij Rempel (olerem) wrote :

udev 138 introduced new "watch" option. I just disabled it and every
thing working fine now.

vi /lib/udev/rules.d/60-persistent-storage.rules

comment
#KERNEL!="sr*", OPTIONS+="watch"

and probobly in 65-dmsetup.rules too:
#OPTIONS+="watch"

TJ (tj) on 2009-02-23
description: updated

There are two parts to this bug.

The actual bug is being caused by "lvm vgscan" opening all block devices on the system writable and then closing them again, which triggers the inotify event. There seems to be no reason they cannot be opened readonly, since the only thing it does is call fstat() and read() on them.

Since the devices include any LVM PVs (which is what it's looking for), this gives you a recursive loop where LVM PVs appear to have been changed, so udev calls "lvm vgscan" again to update its tables.

Part #1 of the fix will be a patch to LVM to only open block devices with O_RDONLY

Part #2 of the fix will be a patch to udev to not add the inotify watch until all RUN rules are processed; this will prevent the initial onslaught on boot, and hopefully guard against anything it runs changing the device resulting in reprocessing (which is better handled with udev rules ordering anyway).

It will obviously not protect against a "udevadm trigger" once booted, so it's possible that we'll trip over this with another program later and will have to fix that one in the same manner as LVM.

It's not yet clear why this does not affect _everyone_, there appears to be a race involved as slowing udev down with various log levels - or using only one core - is enough to hide the bug for some people. I need to work out why that is before uploading the fixes

Changed in lvm2:
importance: Undecided → High
milestone: none → jaunty-alpha-5
status: New → Triaged
Alasdair G. Kergon (agk2) wrote :

On Mon, Feb 23, 2009 at 04:45:18PM -0000, Scott James Remnant wrote:
> There are two parts to this bug.
> The actual bug is being caused by "lvm vgscan" opening all block devices
> on the system writable and then closing them again, which triggers the
> inotify event. There seems to be no reason they cannot be opened
> readonly, since the only thing it does is call fstat() and read() on
> them.

Good luck:-)

> Part #1 of the fix will be a patch to LVM to only open block devices
> with O_RDONLY

I don't know which lines are the problem, but it's probably opening
them for write in case it decides it needs to fix up the metadata.
(It can't handle promotion from read->write so if any possible code path
includes a write, it opens with write access.)

> Since the devices include any LVM PVs (which is what it's looking for),
> this gives you a recursive loop where LVM PVs appear to have been
> changed, so udev calls "lvm vgscan" again to update its tables.

That's what needs fixing - vgscan is just the wrong approach here.
Upstream (2.02.45) we made a start with a new class of 'pvs' output
fields, Physical Volume Label Fields, e.g. pvs -o pv_name.
If only these fields are specified, the code is meant not to scan
and only access the devices specified on the command line.

We need to sort out the right interface for adding one device at a
time so you don't need vgscan at all.

Alasdair
--
<email address hidden>

Alasdair G. Kergon (agk2) wrote :

It's also worth trying it without any vgscans at all - it ought to
cope without nowadays provided all commands access VGs by name
i.e. 'vgchange -ay vg1' and never 'vgchange -ay'.
It'll trigger a vgscan internally if it can't find a VG that
was explicitly named.

Alasdair
--
<email address hidden>

On Mon, 2009-02-23 at 17:44 +0000, Alasdair G. Kergon wrote:

> It's also worth trying it without any vgscans at all - it ought to
> cope without nowadays provided all commands access VGs by name
> i.e. 'vgchange -ay vg1' and never 'vgchange -ay'.
> It'll trigger a vgscan internally if it can't find a VG that
> was explicitly named.
>
Which will cause the exact same problem we're seeing :-/

Scott
--
Scott James Remnant
<email address hidden>

Yannis Tsop (ogiannhs) wrote :

I have this one too! I have LVM over RAID (mdadm). I will post my details tomorrow.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 138-2

---------------
udev (138-2) jaunty; urgency=low

  * Fix inotify watch code to remove any existing watch before beginning rule
    processing, and not to add the watch until the rule processing is
    complete. This stops us "chasing our own tail" if one of the programs we
    run writes to the block device, or at least opens it for writing.
    LP: #332270

  * Fix inotify code to respond to all events, rather than just the first in
    the buffer repeatedly. Ironically this bug was causing fewer people to be
    affected by #332270.

 -- Scott James Remnant <email address hidden> Mon, 23 Feb 2009 20:28:58 +0000

Changed in udev:
status: Triaged → Fix Released

Please note that the above upload is a _partial_ fix.

If you have only a single LVM PV, it should fix things for you. After upgrading, if you've copied the persistent storage rules to /etc/udev/rules.d, you should remove that file. If you just edited the file in /lib/udev/rules.d, it will be overwritten anyway.

If you have multiple LVM PVs, you will still see the problem; though the effect should be somewhat lessened.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Scott James Remnant wrote:

> If you have multiple LVM PVs, you will still see the problem; though the
> effect should be somewhat lessened.

Is there a fix for this in sight?

- --
Cheers,
Daniel

http://daniel.hahler.de/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJoxakfAK/hT/mPgARApdRAKCMK1g8YB3k5MbANu5kGWLtOeWvJgCg/Yqp
VImu/pg116OuxbQSrVZikjw=
=SDXs
-----END PGP SIGNATURE-----

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lvm2 - 2.02.39-0ubuntu7

---------------
lvm2 (2.02.39-0ubuntu7) jaunty; urgency=low

  * debian/patches/open-readonly.patch:
    - Generally we don't need to write to every single block device we open,
      in fact if we do that when scanning we'll make udev think they all
      changed, and it could run "lvm vgscan" again! When the file descriptor
      is needed for writing, it will be closed and opened again. LP: #332270.

 -- Scott James Remnant <email address hidden> Mon, 23 Feb 2009 21:44:51 +0000

Changed in lvm2:
status: Triaged → Fix Released

Alasdair: patch for LVM2 attached - this makes functions that only read from a block device open with O_RDONLY instead of writable-if-they-can. If another function comes along and needs it writable, dev_open_flags() already does the right thing and reopens it anyway.

The LVM2 upload constitutes the complete fix.

udev no longer receives inotify events for the specific block device it's processing the rules for (thus guarding against RUN rules causing them).

LVM no longer opens all block devices for writing when invoked with "vgscan" by udev, so no longer triggers the inotify events for other block devices.

Martin Pool (mbp) wrote :

> This bug was fixed in the package udev - 138-2

For those who may be wondering why they can't install this fix, it's because <https://edge.launchpad.net/ubuntu/jaunty/+source/udev/138-2> shows that it's still waiting to build on all architectures.

Cafuego (cafuego) wrote :

I just built lvm2 and udev from the new sources and indeed, Jaunty boots again.

Thank you! :-)

I can confirm that with only the udev update from jaunty, my laptop
with a single lvm-on-dmcrypt pv works again.

--
Martin <http://launchpad.net/~mbp/>

souplin (klage) wrote :

Can confirm the bug was fixed by installing the package udev-138-2 and the new lvm2, but somehow leaves bug #325690 The conf/conf.d/cryptroot isn't created by update-initramfs anymore.

Don't know, if this is also caused by the last udev updates. Also my other hdds are not mounted - I can't sent them to the sleep modus via hdparm -Y /dev/sdb any longer. The hdd restarts instantly...

Peter Klotz (peter-klotz) wrote :

Installation of package udev-138-2 fixed my problems. Thanks a lot.

Kirit Sælensminde (kayess) wrote :

My eee is now booting with the -8 kernel after the latest updates. Thanks!

Is there a reason why there is no new lvm2 package for amd64?

On Mon, Feb 23, 2009 at 11:30 PM, Launchpad Bug Tracker
<email address hidden> wrote:
> This bug was fixed in the package lvm2 - 2.02.39-0ubuntu7
>
> ---------------
> lvm2 (2.02.39-0ubuntu7) jaunty; urgency=low
>
>  * debian/patches/open-readonly.patch:
>    - Generally we don't need to write to every single block device we open,
>      in fact if we do that when scanning we'll make udev think they all
>      changed, and it could run "lvm vgscan" again!  When the file descriptor
>      is needed for writing, it will be closed and opened again.  LP: #332270.
>
>  -- Scott James Remnant <email address hidden>   Mon, 23 Feb 2009 21:44:51
> +0000
>
> ** Changed in: lvm2 (Ubuntu Jaunty)
>       Status: Triaged => Fix Released
>
> --
> udev repeatedly generates "change" events for the same block device(s)
> https://bugs.launchpad.net/bugs/332270
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
Francisco

On Tue, Feb 24, 2009 at 07:34:12PM EST, Francisco Borges wrote:
> Is there a reason why there is no new lvm2 package for amd64?

Not that I know of, it built successfully, so should be getting out to mirrors.

Michael Evans (mjevans1983) wrote :

The Updates appear to work on my laptop
Disk
|-Swap
|-Boot
|-LVM
|-lv-Crypto-Root
|-lv-Crypto-Home
|-lv-Crypto-Other

Iain Lane (laney) wrote :

There is no need to confirm that the update works; it just generates email to everyone subscribed to the bug. Please only comment if the update does not fix this specific issue for you. Thanks.

Seth (bugs-sehe) wrote :

I had similar trouble, removing 'watch' didn't change a thing for me (can't seem to get any of these updates while in jaunty_beta?). I realize therefore my problem is not related to this bug, still I realize that more people will find this bug when searching for heir answer later; let me explain how I solved my issue for reference:

FWIW I found out that 'lvchange -a y' was taking ages on my volume group that contained a snaphot of 18Gb that was 50% full. Ages measuring in minutes, it would prevent successfull boot. I found out by doing repeated

/sbin/lvm vgchange -a n
/sbin/lvm vgchange -a y # takes ages

/sbin/lvm lvremove vg/largish_snapshot

/sbin/lvm vgchange -a n
/sbin/lvm vgchange -a y # takes seconds

Of course I made a backup of the data in the snapshot that I actually wanted to keep :)

Anyone seeing prolonged disk activity while failing boot, check this. I'm considering to file this as a bug against lvm2 (unless I find some documentation on lvm2 that I should have read, stating that this behaviour is by design).

Regards,
Seth

Displaying first 40 and last 40 comments. View all 123 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers