Ubuntu

grub2 fails to boot or install when an LVM snapshot exists

Reported by Alvin on 2010-04-15
112
This bug affects 18 people
Affects Status Importance Assigned to Milestone
grub2 (Debian)
Fix Released
Unknown
grub2 (Ubuntu)
High
Unassigned
Declined for Karmic by Martin Pitt
Lucid
High
Colin Watson

Bug Description

SRU Justification:

Impact: When /boot and / are in an LVM VG and a snapshot is made of an LVM LV in that VG the system will not boot and grub can not be modified (updated, reinstalled) until all snapshots are removed.

Testcase:

Binary package hint: grub2

Steps to reproduce:

- (Lucid beta2 installed from CD)
- Take a snapshot of any volume
- on reboot:
  error: fd0 read error.
  error: no such disk.
  grub rescue>
- Use the rescue cd to get a root shell
- Remove the snapshot and reboot
Now, the system boots. Create a new snapshot to repeat.

The system has 2 SATA disks in mdadm RAID1 configuration with 1 lvm volume on top and no 'normal' partitions.

Also see Comment #6 https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/563895/comments/6

Fix: See debdiff patch
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/563895/+attachment/2589269/+files/grub2_1.98-1ubuntu12.1.debdiff

Alvin (alvind) on 2010-04-15
summary: - Disk not found when booting mdadm RAID1 with snapshotted lvm volum
+ Disk not found when booting mdadm RAID1 with snapshotted lvm volume

I tried to use supergrubdisk to debug this, but I'm having difficulty to get logs. It's just too much for a serial console.

Using supergrubdisk:
- insmod raid
- insmod lvm
- detect OS.
No OS will be detected

Alvin (alvind) wrote :

Setting to confirmed because it was easily reproduced on another server.

Changed in grub2 (Ubuntu):
status: New → Confirmed
Nigel Babu (nigelbabu) wrote :

Setting back to New. If another independent source can confirm the bug, it would be great.

Changed in grub2 (Ubuntu):
importance: Undecided → High
status: Confirmed → New
dblade (listmail) wrote :

I believe I have the same issue although I had hard locked before the reboot and thus interpreted my boot failure as "needing to reinstall the bootloader" and nothing to do with the snapshot I had made earlier. My specific filesystem is full root + boot LVM ext3. The reason I mention that specifically as I have not tested if this issue also persists with a seperate /boot.

I basically couldn't grub to (re)install and the experienced symptoms described here -> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/528670

I finally found out that while a snapshot of root was active there was an extra "/dev/mapper/lvgname-lvname-real" device. The key part being "real" here I suppose. I'm pretty new to LVM, but my impression was that with whatever manipulation was going on for copy on write functionality, it could be causing grub to get confused. With a snapshot active Blkid output would now show 2 devices with the exact same UUID.

The moment I killed the snapshot, deleted device.map and ran nothing more than `grub-install /dev/sda`, it completed normally, generated grub.cfg entries loading all the proper modules (raid mdraid lvm ext2) as well as populating a proper /boot/grub/core.img.

dblade (listmail) wrote :

I did not clarify previously, but the LVM physical volume is indeed a mdraid root mirror.

# pvs
  PV VG Fmt Attr PSize PFree
  /dev/md0 mypv lvm2 a- 69.24g 31.24g

Bernhard Schmidt (berni) wrote :

The installation of grub2 fails when _any_ snapshot is present. Not only on the root filesystem (or boot), not even mounted. In my system it was a snapshot of a WinXP volume used by KVM. Deleting it worked just fine.

root@lxbsc02:/# grub-install /dev/sda
/usr/sbin/grub-probe: error: no mapping exists for `wdc750g-root'.
Auto-detection of a filesystem module failed.
Please specify the module with the option `--modules' explicitly.

root@lxbsc02:~# lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  home wdc750g -wi-ao 40,00g
  kvm-winxp wdc750g owi-a- 20,00g
  kvm-winxp-snap wdc750g swi-a- 8,00g kvm-winxp 10,81
  root wdc750g -wi-ao 20,00g
  swap wdc750g -wi-ao 4,00g
  torrent wdc750g -wi-ao 200,00g

root@lxbsc02:~# lvremove wdc750g/kvm-winxp-snap
Do you really want to remove active logical volume kvm-winxp-snap? [y/n]: y
  Logical volume "kvm-winxp-snap" successfully removed
root@lxbsc02:~# grub-install /dev/sda
Installation finished. No error reported.

This is a clusterf*ck of a bug.

Bernhard Schmidt (berni) on 2010-05-28
Changed in grub2 (Ubuntu):
status: New → Confirmed
summary: - Disk not found when booting mdadm RAID1 with snapshotted lvm volume
+ grub-install fails when LVM snapshot exists

Why change the description? It also fails when grub is successfully installed.

dblade (listmail) wrote :

Perhaps the description should become "grub2 fails to boot or install when an LVM snapshot exists"

Bernhard Schmidt (berni) wrote :

Yes, you are absolutely right. I wanted to stress that it does not seem to be related to mdadm RAID1 as the original description suggested. Changed.

summary: - grub-install fails when LVM snapshot exists
+ grub2 fails to boot or install when an LVM snapshot exists
Alvin (alvind) wrote :

Except for systems using mdadm RAID1, all those I tried with LVM snapshots can boot. You're saying that taking a snapshot on /any/ Lucid system makes it unbootable?

Bernhard Schmidt (berni) wrote :

Yes, I have had this bug on a system without any RAID (neither hardware nor mdraid) running Lucid amd64. And I can reproduce it on that particular box.

root@lxbsc02:~# lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  azureus wdc750g -wi-ao 300,00g
  home wdc750g -wi-ao 40,00g
  kvm-winxp wdc750g -wi-a- 20,00g
  root wdc750g -wi-ao 20,00g
  swap wdc750g -wi-ao 4,00g
  torrent wdc750g -wi-ao 200,00g

root@lxbsc02:~# lvcreate -s -L 2G -n kvm-winxp-fresh wdc750g/kvm-winxp
  Logical volume "kvm-winxp-fresh" created
root@lxbsc02:~# grub-install /dev/sda
/usr/sbin/grub-probe: error: no mapping exists for `wdc750g-root'.
Auto-detection of a filesystem module failed.
Please specify the module with the option `--modules' explicitly.
root@lxbsc02:~# lvremove wdc750g/kvm-winxp-fresh
Do you really want to remove active logical volume kvm-winxp-fresh? [y/n]: y
  Logical volume "kvm-winxp-fresh" successfully removed
root@lxbsc02:~# grub-install /dev/sda
Installation finished. No error reported.

Oddly enough I cannot reproduce it on my system at home, which is also running Lucid.

root@pest:~# lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  btrfs wdc -wi-a- 32,00g
  home wdc -wi-ao 16,00g
  karmic wdc -wi-ao 16,00g
  karmicold wdc -wi-a- 16,00g
  swap wdc -wi-ao 2,00g
  videos wdc -wi-a- 100,00g
  vm-winxp wdc -wi-a- 21,00g
root@pest:~# lvcreate -s -L 2G -n kvm-winxp-fresh wdc750g/vm-winxp
  Volume group "wdc750g" not found
root@pest:~# grub-install /dev/sda
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
Installation finished. No error reported.

Yes, the VG/LV names are a bit different. I tried making the snapshot LV name very long, did not change anything.

Bernhard Schmidt (berni) wrote :

Err that last part obviously showed that I did not create a snapshot, but it seems to work even when I do it right

root@pest:~# lvcreate -s -L 2G -n kvm-winxp-fresh wdc/vm-winxp
  Logical volume "kvm-winxp-fresh" created
root@pest:~# grub-install /dev/sda
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
error: cannot open `/dev/sdb' while attempting to get disk size.
Installation finished. No error reported.

Changed in grub2 (Debian):
status: Unknown → Confirmed
Seth (bugs-sehe) wrote :

Anything about this? Anytime I accidentally shutdown the system while having a snapshot (e.g. of my /home fs) I get a borked grub boot. _annoying_

I need to boot a rescue CD to lvremove my snapshot before I can boot again

Changed in grub2 (Debian):
status: Confirmed → Fix Released
Colin Watson (cjwatson) wrote :

Fixed in Maverick by merging this upstream change:

grub2 (1.98+20100702-1) unstable; urgency=low

  * New Bazaar snapshot.
[...]
    - Skip LVM snapshots (closes: #574863).
[...]

 -- Colin Watson <email address hidden> Fri, 02 Jul 2010 17:42:56 +0100

Changed in grub2 (Ubuntu):
status: Confirmed → Fix Released
p (p1) wrote :

this issue should be mentioned in the documentation of grub2 until the fix arrives in LTS.

Linus van Geuns (nirkus) wrote :

Upgraded from maverick to natty and had a similar issue:
- bootfs is ext3, rootfs ext4
- both are logical volumes on top of raid1
- did snapshots of root & boot before upgrade and it worked for the first reboot

After that, grub2 just booted the menu entry, did a lot of hdd access and stoped w/o any errors or messages.

I could get the kernel & initramfs loaded by striping the menu entry down to:
insmod part_msdos, raid, lvm, ext2
root (vg-device)
kernel...
initramfs...

but mounting the rootfs failed within initramfs and it dropped to a shell.
mounting the logical volume (roots) within that shell worked.

changing the filesystem UUIDs within my snapshots of rootfs & boot didnt change anything.

after deleting both snapshots, grub2 & initramfs booted w/o any error.

Torsten Landschoff (torsten) wrote :

While installation security updates to my lucid system, this made my system unbootable last week. I spent half an hour today to make it bootable again.

I booted from supergrubdisk which also failed to detect LVM (it usually did). I ended up using Knoppix and noticed the leftover snapshot (created via schroot) on boot, deleting it. I then tried to chroot into my Lucid system which failed because Knoppix is i386 and my Lucid install is amd64.

Rebooting with the Lucid installation medium, I was surprised that the grub installation on my hard drive magically started to work again and booted fine into my Lucid install.

I would deem this a really important problem and I am all for fixing it in Lucid given that a patch exists.
BTW: The snapshot name created by schroot is quite long, as it contains a UUID.

fred (ubuntu-launchpad-lk2) wrote :

not having a fix for this in "LTS" after more than a year makes me sad

wasted 3 hours of my life on this earlier this year - bug is known, confirmed and fixed - why not push a new grub version for lucid?

Benaiah (dougie-hobson) wrote :

I am also wondering why a bug fix is not being pushed to lucid. I know that with LTS versions you do the whole feature freeze thing and only release security updates and I think bug fixes. This is not a feature addition, it is a bug that needs to be fixed. Is this not possible?

The answer to "why didn't this get into Lucid?" is nobody did the SRU process:

https://wiki.ubuntu.com/StableReleaseUpdates#Procedure

Attached debdiff for Lucid

description: updated
tags: added: testcase

Attached debdiff for Lucid (this time using "lucid-proposed" pocket)

Attached debdiff for Lucid (this time using "lucid-proposed" pocket)

description: updated
description: updated

Confirmed patch from PPA
https://launchpad.net/~nutznboltz/+archive/lucid-grub2-skip-lvm-snapshots
fixes issue.

VM with patch via PPA installed booted despite snapshot:

nutz@lp-563895:~$ dpkg -l |grep grub
ii grub-common 1.98-1ubuntu12.1~ppa1~lucid1 GRand Unified Bootloader, version 2 (common
ii grub-pc 1.98-1ubuntu12.1~ppa1~lucid1 GRand Unified Bootloader, version 2 (PC/BIOS

nutz@lp-563895:~$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-ao 3.81g
  lvol0 vg0 swi-a- 1.71g lv0 0.17
  swap vg0 -wi-ao 488.00m

nutz@lp-563895:~$ df -h /boot
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-lv0 3.9G 901M 3.0G 24% /

No one has responded to my effort to convert this bug report into an SRU request so I did the only thing I know, I opened LP: #888069

I applied the patched grub-pc and grub-common packages from my PPA
https://launchpad.net/~nutznboltz/+archive/lucid-grub2-skip-lvm-snapshots
to an HP DL-165 G5 today. Prior to applying the patches the system was unbootable.

One thing that I did have to do was:

1. aptitude purge grub-pc grub-common
2. cd /boot/grub
3. rm -r *
4. cd -

Then install the patched packages. For some reason the APT "purge" command leaves many files under /boot/grub

Colin Watson (cjwatson) on 2011-11-15
Changed in grub2 (Ubuntu Lucid):
milestone: none → ubuntu-10.04.4
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu Lucid):
status: New → Confirmed

I will be on vacation through Jan 5, 2012. Please do not ask for testing until after that date, thanks.

Martin Pitt (pitti) on 2012-01-20
Changed in grub2 (Ubuntu Lucid):
status: Confirmed → Triaged
importance: Undecided → High
assignee: nobody → Colin Watson (cjwatson)

Hello Alvin, or anyone else affected,

Accepted grub2 into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in grub2 (Ubuntu Lucid):
status: Triaged → Fix Committed
tags: added: verification-needed

Interesting dates in the life of LP: #563895

March 21, 2010 - Reported as Debian #574863
April 4, 2010 (elapsed 14 days) Reported as LP: #563895
June 2, 2010 (elapsed 73 days) Fix committed into Debian and Debian bug closed.
July 5, 2010 (elapsed 106 days) Fix committed in Ubuntu 10.10 (Maverick)
September 10, 2011 (538 days) Complaint (not by me) in LP #563895 using the words "not having a fix for this in "LTS" after more than a year makes me sad"
November 8, 2011 (597 days) I convert bug into SRU request with debdiff of patch and PPA with patched grub2 packages.
November 14, 2011 (604 days) by submitting this question I am able to get someone to acknowledge that this bug needs work.
January 20, 2012 (670 days) day which I, the only person on Earth who is going to know enough to do the testing, is referred to by Marvin Pitt as "anyone else affected".

Colin Watson (cjwatson) wrote :

@nutznboltz: Thanks for your patch, and sorry I took so long to deal with it. It needed to be reformatted into a patch with proper headers and attribution in debian/patches/, to match the rest of the package. Since this has been waiting so long, I just went ahead and did this rather than walking you through it, but I've left your name in the changelog.

And yes, it did take a long time. That's because we have more work than it's humanly possible to do. I'm not sure that recriminations are productive at this point?

Colin Watson (cjwatson) wrote :

(Also, I think I can probably do the testing if necessary, so I think you're exaggerating about "only person on Earth", not to mention that Martin's comment was generated by the sru-accept.py script rather than written by hand; but given said inhuman amounts of work it would probably stand a better chance of happening quickly if somebody else did it.)

Sorry, Colin, it's not about you. I'll explain later if I get the chance.

> Why aren't they testing now? What could be wrong?

Wait! Maybe it's that they are all stupid! Yeah, that's it they're too stupid to test. Don't worry stupid people, I'll do your testing for you.

nutz@lp-563895:~$ apt-cache policy grub-pc
grub-pc:
  Installed: 1.98-1ubuntu13
  Candidate: 1.98-1ubuntu13
  Version table:
 *** 1.98-1ubuntu13 0
        100 /var/lib/dpkg/status
     1.98-1ubuntu12 0
        900 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     1.98-1ubuntu5 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
ksta@lp-563895:~$ apt-cache policy grub-pc grub-common
grub-pc:
  Installed: 1.98-1ubuntu13
  Candidate: 1.98-1ubuntu13
  Version table:
 *** 1.98-1ubuntu13 0
        100 /var/lib/dpkg/status
     1.98-1ubuntu12 0
        900 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     1.98-1ubuntu5 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
grub-common:
  Installed: 1.98-1ubuntu13
  Candidate: 1.98-1ubuntu13
  Version table:
 *** 1.98-1ubuntu13 0
        100 /var/lib/dpkg/status
     1.98-1ubuntu12 0
        900 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     1.98-1ubuntu5 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
nutz@lp-563895:~$ sudo grub-install /dev/vda
Installation finished. No error reported.
nutz@lp-563895:~$ lsb_release -ds ; uname -a
Ubuntu 10.04.3 LTS
Linux lp-563895 2.6.32-37-server #81-Ubuntu SMP Fri Dec 2 20:49:12 UTC 2011 x86_64 GNU/Linux

nutz@lp-563895:~$ sudo lvcreate -s -l 437 /dev/vg0/lv0
  Logical volume "lvol0" created
nutz@lp-563895:~$ sudo vgs
  VG #PV #LV #SN Attr VSize VFree
  vg0 1 3 1 wz--n- 6.00g 0
nutz@lp-563895:~$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-ao 3.81g
  lvol0 vg0 swi-a- 1.71g lv0 0.00
  swap vg0 -wi-ao 488.00m
nutz@lp-563895:~$ df -h / /boot
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-lv0 3.9G 955M 2.9G 25% /
/dev/mapper/vg0-lv0 3.9G 955M 2.9G 25% /
nutz@lp-563895:~$ sudo reboot

The system rebooted with the snapshot. Everything works. Thanks.

tags: added: verification-done
removed: verification-needed
dblade (listmail) wrote :

nutznboltz please post information that pertains to the bug. All the extra stuff you are tossing in serves no purpose.

Thanks.

@dblade it's the price that you pay for not doing the SRU work yourself. The reward for being the one to do the work is that you don't have to listen to the ones who are doing it for you. Why not spend some time learning how to do the work?

I have some information for you:
http://www.youtube.com/watch?v=h1NcFYUS3uA&fmt=18#t=2m52s

Are you an Ubuntu Linux Bug Fool? http://tiny.cc/ubuntu-linux-bug-fool

dblade (listmail) wrote :

Most people applied the fixed package manually and moved on over a year ago. Are you really surprised?

I'm not here to debate whether or not is right or wrong to report a bug, or provide info regarding a bug, and then not be involved every step of the way. People have the right to contribute or not contribute as they see fit. Your attitude is not going to affect this fact in a positive way.

All you have really achieved here is made pull my name off the CC list for this bug. I did it because the latest updates you've made are the equivalent of spam.

Maybe you need to take a step back and realize how little control you actually have on this matter.

Torsten Landschoff (torsten) wrote :

I actually applied the fix manually but did not set the fixed package on hold as I had the hope that any update to the grub package in lucid would fix this.

Yesterday my system crashed while working with schroot and snapshots and is now unbootable again :-) So I am interested in a fix once again and hope it goes into Lucid. I will try and install the new proposed fix.

Torsten Landschoff (torsten) wrote :

Okay, I just installed the version from lucid-proposed:

torsten@sharokan:~$ dpkg -l|grep grub
ii grub-common 1.98-1ubuntu13 GRand Unified Bootloader, version 2 (common files)
ii grub-pc 1.98-1ubuntu13 GRand Unified Bootloader, version 2 (PC/BIOS version)

I had to run

# grub-install /dev/sda

to actually update the grub installation (it would be nice if installing the new package would do that automatically, but this is of course a risk wrt. possible regressions).

I created a snapshot again and low and behold: I can boot just fine even with existing LVM snapshots.
Thanks! I am all for moving this to lucid-updates.

On Tue, Jan 24, 2012 at 10:02:37AM -0000, Torsten Landschoff wrote:
> I had to run
>
> # grub-install /dev/sda
>
> to actually update the grub installation

Run 'dpkg-reconfigure grub-pc' to set this up to run automatically on
future upgrades.

@Colin: that will not work for /dev/vda on Ubuntu 10.04.

Oh, right after I wrote that I remembered LP: #623609
Does 1.98-1ubuntu13 fix that?

Colin Watson (cjwatson) wrote :

Torsten asked for /dev/sda, not /dev/vda, so that is moot anyway.

Yes, as indicated in the changelog, 1.98-1ubuntu13 should also fix bug
623609
.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 1.98-1ubuntu13

---------------
grub2 (1.98-1ubuntu13) lucid-proposed; urgency=low

  [ Colin Watson ]
  * Handle partition devices without corresponding disk devices
    (LP: #623609).

  [ Ken Stailey ]
  * Backport upstream patch to skip LVM snapshots (LP: #563895).
 -- Colin Watson <email address hidden> Fri, 20 Jan 2012 12:08:36 +0000

Changed in grub2 (Ubuntu Lucid):
status: Fix Committed → Fix Released

Thanks Colin Watson, you do such great work!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.