lvm2

cannot boot root on lvm2 with (largish) snapshot

Bug #360237 reported by Seth on 2009-04-12

This bug report is a duplicate of: Bug #1396213: LVM VG is not activated during system boot. Edit Remove

This bug affects 9 people

Affects		Status	Importance	Assigned to	Milestone
	lvm2	Invalid	Undecided	Unassigned
	lvm2 (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

When running root-on-lvm2 with the root being part of a vg that contains snapshotted volume(s), booting may fail if the snapshot size (or fill rate) grows over a certain size.

The kernel will only wait a certain amount of time until dropping to an initramfs-shell reporting 'Gave up waiting for root device' as the reason. The harddisk drive activity indicator will still show lots of disk access for some more time (minutes in out case).
'ps | grep lvm' reveals that 'lvchange -a y' is taking a long time to complete.
Waiting for the disk activity to die down, and exiting the shell will allow the boot to resume normally.

Can anybody explain *why* it could take that long - each and every time?
==================== DETAILS
More specificly, my volume group contained an Intrepid root partition of 20Gb (15Gb filled). I created a snapshot of this (18Gb) in order to 'sandbox upgrade' to Jaunty beta. This I did. I was kind of surprised by the volume of changes on the snapshot: the snapshot was 53% full after upgrade (so about 9Gb of changed blocks vs. Intrepid). On reboot, I found out that the system would not reboot itself.

I have spent a long time around various seemingly related bugs (#290153, #332270) for a cause until I found out the culprit myself. I have not been able to find any (lvm) documentation warning to the fact that lvm operations might take several minutes (?!) to complete on snapshotted volumes.

At the very least this warrants a CAPITAL RED warning flag in the docs, IMHO: using large snapshots might render a system unbootable (remotely) with root on lvm. Manual intervention at console/serial is required!

1. [untested] it doesn't matter whether the root volume is actually a snapshot or origin, as long as the volume group contains said snapshot (in my case, intrepid was on the origin and jaunty on tje snapshot; both systems failed boot with the same symptoms).
2. [untested] if root volume is in a separate vg things might work ok (assuming activating several volume groups can be done in parallel)
3. [partially tested: freshly snapshotted system booted ok] I suspect nothing is the matter with a snapshot sized 18Gb, unless it fills up ('exception blocks' get used). However there is not much use in a snapshot of 18Gb if you are only allowed to use small parts of it.
4. [tested] as a side note, once booted, both systems were reasonably performant (at least in responsiveness)
5. [tested] other volume groups in the same server did not suffer noticeable performance penalties even when the problematic one performed badly
6. [tested] the performance was back to normal (acceptable), even though my 40Gb Home lv still featured a (largely unaltered) snapshot of 5Gb/6% (in the _same_ volume group). This indicates that nothing special to be wrong/corrupted in the vg metadata.
7. [tested] rebooting/shutting down seemed flawed when initiated from Jaunty (but it seems unrelated to me: Jaunty appears to use kexec for reboot, causing problems with my terminal display and harddisk spin-down as well)

-------------- executed from the initramfs shell:
/sbin/lvm vgchange -a n
/sbin/lvm vgchange -a y # takes several minutes to complete (disk light on continuously)
/sbin/lvm vgchange -a n
/sbin/lvm vgchange -a y # again, same agony (continuous period of solid activity)

/sbin/lvm lvremove vg/largish_snapshot

/sbin/lvm vgchange -a n
/sbin/lvm vgchange -a y # only takes seconds

Of course I made a backup of the data in the snapshot that I actually wanted to keep :)