Ubuntu

install on degraded raid1 does not boot, drops to initramfs shell

Reported by xor on 2011-05-06
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Critical
Clint Byrum
Natty
Undecided
Unassigned
mdadm (Ubuntu)
Critical
Clint Byrum
Natty
Critical
Clint Byrum

Bug Description

Yesterday I did a fresh install of Natty amd64. I used the PXE installer so all packages are up to date right after installation.

I partitioned the disk to have two software raid1 devices. One for swap, one for / with xfs.
Installation completed successfully, grub was installed to MBR.

The system has a >95% failure rate in booting.
At first it would hang on a purple screen. I blindly typed "reboot" at this screen, which worked. So I figured out that it would boot to a shell but not show it.
Then, I added "nosplash" to the kernel command line and removed "quiet".

This revealed that the system drops into a BusyBox shell labled "initramfs" when booting fails.
Unfortunately, I cannot tell whether it shows any errors on screen because the BusyBox shell resets the screen buffer, CTRL+PageUp does not work.

Please help me how to figure out what is wrong. Either tell me which log file I can search for the very first booting messages OR release a package update which fixes the screen-clearing of the initramfs shell.
If you do that I will take the effort of doing the 10-20 boot attempts to get to the point where I can update the packages...
If the initramfs-boot-process is logged anywhere I can check the log files without getting the system to boot through a bootable USB stick...

==== SRU Justification ====

IMPACT: Users who install on a degraded RAID1 or who lose a disk drive will be unable to boot.

TEST CASE:

1. Install a natty system with root on a software RAID1
2. After booting once, shutdown
3. remove one disk entirely
4. boot the system, it may drop to initramfs, or it may boot degraded
5. reboot, this time it should reliably drop to initramfs
6. poweroff and add the disk in again, it should now boot degraded
7. install updated mdadm
8. repeat steps 2-5 , it should pass and NOT drop to initramfs

REGRESSION POTENTIAL: The fix that has been uploaded to oneiric already takes care to err on the side of responding to degraded arrays. Still mdadm is very sensitive and so probably needs a bit of extra testing to ensure that it works properly, including passing the usual RAID1 iso install tests (with the additional step of installing the mdadm package from proposed right after install)

xor (xor) wrote :

Notice that I also tried "rootdelay=100" and "roowait" as kernel parameters, both do not help. In fact it does seem to ignore them, there is no visible delay before it drops to shell.

Changed in ubuntu:
importance: Undecided → Critical
xor (xor) wrote :

SpamapS on IRC figured out that the messages which are hidden by the screen clearing of BusyBox are shown on a different terminal which can be accessed with ALT+F7.

A picture of them is attached.

Clint Byrum (clint-fewbar) wrote :

I saw this once in ISO testing but was unable to reproduce, looks like xor has a system that is far more reliably producing this, so must be a race between md and the mount of the root fs

Changed in ubuntu:
status: New → Confirmed
xor (xor) wrote :

rootdelay=100 => Same messages as without it
rootwait => Also the same

Clint Byrum (clint-fewbar) wrote :

I've done some deep investigation of this issue, many thanks to xor for the feedback. It is 100% reproducible by degrading a RAID1 array on a VM and simply rebooting. Booting with 'quiet' removed, however, changes the race, and so, it doesn't manifest.

I believe the problem is that wait-for-root does not wait long enough for the MD devices to actually be ready. There's a second problem which is that because wait-for-root doesn't detect the degraded array, so our "boot degraded" option is ignored. I'll open a second task against mdadm as well.

affects: ubuntu → initramfs-tools (Ubuntu)
Changed in mdadm (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
tags: added: natty regression-release
Changed in initramfs-tools (Ubuntu):
assignee: nobody → Clint Byrum (clint-fewbar)
summary: - Fresh natty install on raid1 does not boot, drops to initramfs shell
+ install on degraded raid1 does not boot, drops to initramfs shell
Changed in initramfs-tools (Ubuntu):
status: Confirmed → In Progress
status: In Progress → Invalid
Changed in mdadm (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Clint Byrum (clint-fewbar)
Clint Byrum (clint-fewbar) wrote :

I went ahead and marked the initramfs task as invalid. Major refactoring is coming to initramfs tools, and I don't think it was actually doing anything "wrong". There's just a lag between being told "that device exists" and that device having all of its error checking enabled so that mount would force the mount fail and degraded check. So since I can't find anything definitive on what the udev event means for MD, I will assume it means "now you can run mdadm checks against it" not "now its perfect and happy and ready to be mounted".

I'm preparing an upload now to oneiric that will do the degraded array check after wait-for-root but before trying the mount, and provide the same degraded array question on that detection.

Changed in initramfs-tools (Ubuntu Natty):
status: New → Invalid
Changed in mdadm (Ubuntu Natty):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Clint Byrum (clint-fewbar)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mdadm - 3.1.4-1+8efb9d1ubuntu5

---------------
mdadm (3.1.4-1+8efb9d1ubuntu5) oneiric; urgency=low

  * Call checks in local-premount to avoid race condition with udev
    and opening a degraded array. (LP: #778520)
 -- Clint Byrum <email address hidden> Tue, 24 May 2011 13:05:01 -0700

Changed in mdadm (Ubuntu):
status: In Progress → Fix Released
Clint Byrum (clint-fewbar) wrote :

I've uploaded the oneiric package to natty-proposed as well. In my testing the new method is far more reliable at detecting degraded arrays and always either shows the degraded array screen or boots the system, never dropping indiscriminately to the initramfs> prompt.

description: updated

Accepted mdadm into natty-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in mdadm (Ubuntu Natty):
status: Triaged → Fix Committed
tags: added: verification-needed
xor (xor) wrote :

I've done a fresh installation of Ubuntu server natty three days ago, on a degraded raid, and it always boots.
So if the package was deployed already then it fixes the issue, yes.

(The original system with Ubuntu desktop does not exist anymore, had to downgrade to 10.10 due to other issues.)

Excerpts from xor's message of Fri Jun 10 15:09:58 UTC 2011:
> I've done a fresh installation of Ubuntu server natty three days ago, on a degraded raid, and it always boots.
> So if the package was deployed already then it fixes the issue, yes.
>
> (The original system with Ubuntu desktop does not exist anymore, had to
> downgrade to 10.10 due to other issues.)

Hi xor, thanks for trying it out again.

No the new mdadm package hasn't been moved to natty-updates yet. Since
this is a race condition, its entirely possible that something else was
slightly different enough in your new install to avoid the issue. While
it sometimes manifests with an install on degraded RAID1, the test case
is to install on a full RAID1 and then remove one drive. That seems to
cause the problem much more reliably. Especially on a KVM virtual machine.

xor (xor) wrote :

Ok. Is it possible for you guys to deploy the fixed package / close this bug without me setting up another testing machine?

You've said that you were able to reproduce it yourself so I guess I'm fine with you closing the bug if you say that you've fixed it according to your reproducing-setup.

Clint Byrum (clint-fewbar) wrote :

Excerpts from xor's message of Fri Jun 10 22:27:06 UTC 2011:
> Ok. Is it possible for you guys to deploy the fixed package / close this
> bug without me setting up another testing machine?
>
> You've said that you were able to reproduce it yourself so I guess I'm
> fine with you closing the bug if you say that you've fixed it according
> to your reproducing-setup.
>

We need somebody to try out the test case with the -proposed package to
verify, independently from myself, that it fixes the problem. It does
not need to be you, just somebody.

Clint Byrum (clint-fewbar) wrote :

bug #820111 notes a regression introduced by this change in oneiric. Will fix in oneiric and upload a new version to natty-proposed.

tags: added: verification-failed
removed: verification-needed
Clint Byrum (clint-fewbar) wrote :

A fix for bug #820111 has been uploaded to natty-proposed (in upload queue now), so marking this as verification-needed again.

tags: added: verification-needed
removed: verification-failed

Hello xor, or anyone else affected,

Accepted mdadm into natty-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: testcase
piwacet (davrosmeglos) wrote :

Apologies if I'm misunderstanding something, but I'm not sure this bug is fixed in Oneiric.

I've tested this in virtualbox with both the server 64-bit beta2 and the daily build server 64-bit for Sept. 27th. If I've hit this same problem, it's 100% reproducible.

Two virtual disks, with /dev/sda1 /dev/sdb1 as raid 1 /dev/md0 for /.
/dev/sda2 /dev/sdb2 as raid 1 /dev/md1 for swap.

If I "disconnect" one of the drives, on booting the grub countdown appears, and then the screen goes black. The only response I can get from the machine is if I blindly press 'enter', then type 'reboot', then press 'enter' again, which causes the virtual computer to reboot. It doesn't matter which of the 2 disks I 'disconnect.'

So far, this has happened 100% of boot attempts. If I reattach the disk, it boots fine.

Apologies if I've misunderstood something. I can do some virtualbox testing if that would help.

Excerpts from piwacet's message of Wed Sep 28 01:43:19 UTC 2011:
> Apologies if I'm misunderstanding something, but I'm not sure this bug
> is fixed in Oneiric.
>
> I've tested this in virtualbox with both the server 64-bit beta2 and the
> daily build server 64-bit for Sept. 27th. If I've hit this same
> problem, it's 100% reproducible.
>
> Two virtual disks, with /dev/sda1 /dev/sdb1 as raid 1 /dev/md0 for /.
> /dev/sda2 /dev/sdb2 as raid 1 /dev/md1 for swap.
>
> If I "disconnect" one of the drives, on booting the grub countdown
> appears, and then the screen goes black. The only response I can get
> from the machine is if I blindly press 'enter', then type 'reboot', then
> press 'enter' again, which causes the virtual computer to reboot. It
> doesn't matter which of the 2 disks I 'disconnect.'
>
> So far, this has happened 100% of boot attempts. If I reattach the
> disk, it boots fine.
>
> Apologies if I've misunderstood something. I can do some virtualbox
> testing if that would help.

piwacet, I've done quite a few test runs doing exactly as you say, and
never been dropped to initramfs. Also I've never had the screen go black
on me. I have been testing with KVM though, not virtualbox.

Can you make sure that the kernel commandline has no "quiet" argument,
and also has 'nomodeset' so we get a text bootup? Its very odd that it
is all black.

piwacet (davrosmeglos) wrote :

Here's my kernel command line information:

From dmesg:

[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.0.0-12-server root=UUID=b63798ce-2a88-4598-8dad-a9c1cd679330 ro nomodeset

And /etc/default/grub minus the commented lines:

GRUB_DEFAULT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=2
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="nomodeset"

Even with nomodeset, screen stays black.

I'll see if I can test this out in KVM to see if it's different. Maybe the black screen is a virtualbox thing.

piwacet (davrosmeglos) wrote :

OK figured it out.

If I understand this right, this bug remains fixed in Oneiric.

I tested this in a KVM machine, which shows early-boot console output that virtualbox does not. When I remove a drive in KVM, I'm given a choice to boot the degraded raid, or drop to a shell. After a timeout, it automatically drops to the shell; if I choose 'y' before the timeout, it boots the degraded raid.

The problem with virtualbox is that it does not show this early console output. But the process proceeds correctly: if in virtualbox I remove a drive, boot, and wait a bit, and then blindly press 'y' and enter at the still-blank screen, it boots to the degraded raid. If I press enter only (selecting the default to drop to a shell), then 'reboot', then enter, it reboots. None if this is visible on the screen, that's the only problem. If I do boot to the degraded raid, the console output becomes visible as it nears the end of the boot process.

So the problem I was experiencing has nothing to do with this bug. It seems to be simply virtualbox not displaying console output until very late in a successful boot process. Don't know why it does this.

So apologies for the noise, but hopefully this can be helpful if someone else runs into this problem in a virtualbox VM.

Thanks!

Clint Byrum (clint-fewbar) wrote :

Excerpts from piwacet's message of Thu Sep 29 00:10:40 UTC 2011:
> OK figured it out.
>
> If I understand this right, this bug remains fixed in Oneiric.
>
> I tested this in a KVM machine, which shows early-boot console output
> that virtualbox does not. When I remove a drive in KVM, I'm given a
> choice to boot the degraded raid, or drop to a shell. After a timeout,
> it automatically drops to the shell; if I choose 'y' before the timeout,
> it boots the degraded raid.
>
> The problem with virtualbox is that it does not show this early console
> output. But the process proceeds correctly: if in virtualbox I remove a
> drive, boot, and wait a bit, and then blindly press 'y' and enter at the
> still-blank screen, it boots to the degraded raid. If I press enter
> only (selecting the default to drop to a shell), then 'reboot', then
> enter, it reboots. None if this is visible on the screen, that's the
> only problem. If I do boot to the degraded raid, the console output
> becomes visible as it nears the end of the boot process.
>
> So the problem I was experiencing has nothing to do with this bug. It
> seems to be simply virtualbox not displaying console output until very
> late in a successful boot process. Don't know why it does this.
>
> So apologies for the noise, but hopefully this can be helpful if someone
> else runs into this problem in a virtualbox VM.
>

piwacet, thanks for the detailed testing and confirmation that the fix has
indeed worked. I would suggest opening a bug with Virtualbox about this
problem. Its entirely possible that our kernel or their virtualization
env is doing something terribly wrong early in the boot... its pretty
important that initramfs be visible to users.

jdonner (jeffrey-donner) wrote :

Hi; this is the closest bug I found to my problem. I did a clean install of 11.10 Oneiric Ocelot (Kubuntu) on a machine which has a degraded software RAID disk (but not onto that disk), did not install mdadm, and it booted fine. I then merely installed mdadm but did not attempt to activate the drive, and I get the initramfs / busybox screen. Here are screenshots of dmesg, and of grepping for failures: http://imgur.com/a/NEr12

Let me know if there are any experiments I can do.

Thanks.

Galen Seitz (galens) wrote :

I am also experiencing problems related to booting with a degraded md RAID1 array. I performed a clean install of 64 bit Mythbuntu 11.10 onto a single disk on an i3 2100 system. Note that I have configured this system for UEFI booting. Later I added a second disk, performing the following steps:

Installed Seagate ST31000528AS Barracuda 7200.12 1TB drive.
Partitioned as single GPT partition using parted.
Created RAID1 array with single disk.
  mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/sdb1 missing
Created ext4 filesystem on /dev/md0 using mkfs.ext4.
Used blkid device to determine UUID.
Added entry in /etc/fstab to mount filesystem using UUID. Mount point
of /mythtv/drive0.
Created mount point using mkdir.

After adding the disk, booting gives me a blank screen and never displays any messages. This is with set gfxpayload=$linux_gfx_mode, quiet, and splash. If I remove those, the screen stays blank for ~20 seconds, then the busybox/initramfs screen is displayed. Once busybox is displayed, a switch to vt7 reveals the degraded RAID messages. In both cases the kernel option vt.handoff=7 is being passed.

Galen Seitz (galens) wrote :

It appears my degraded RAID1 problem was actually due to an instance of bug #699802. After working around this bug, I am able to boot with a degraded RAID1 and get the appropriate prompt, or skip the prompt with BOOT_DEGRADED=true. This is with quiet and splash removed. There is still a problem when using quiet and splash, but I think that this is far more likely to be a UEFI/video related problem. Apologies for the noise.

michael brenden (mike-brenden) wrote :

There's some race condition now that doesn't give RAID5 (and probably other types) mdadm adequate time, so it shows up with faulty/degraded arrays.

Another problem makes boot process ignore the kernel parm "bootdegraded=true"

This forces people with mdadm software RAID into a "busybox" initramfs shell, basically a blackhole of doom and downage.

Apparently the only recourse is to totally reinstall the system (again), not tell it about the existing RAID5, then go boot and do some fiddling around with udevadm delay, per https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/872220?comments=all

and then cross my fingers, give it a reboot, and hope it boots without race condition incorrectly marking RAID as failed...
and if not then re-install everything and try the next iteration of screwing around.

GOT SOME NEWS FOR YA -- THIS IS THE FSCKING *DEFINITION* OF MICROSOFT.

michael brenden (mike-brenden) wrote :

I reinstalled, rebooted, and tried the "fix" given here:
http://ubuntuforums.org/showpost.php?p=11388915&postcount=18

The shit doesn't work. Server still comes up wrong. These kinds of cavalier games and changes are life changing and extremely damaging.

After fucked by 10.04 LTS fiasco with Upstart shit, which came after the 6.06 LTS fucked by archive location change, I've been shown by the now several years long-term proof that Ubuntu is actually a joke. Tonight's a pivot, forcing me back to Debian, or -eek- CentOS. Have a nice time, peeps. Peace.

Related

https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/778520

https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/917520

Dimitri John Ledkov (xnox) wrote :

Dear Michael Brenden,
Thank you for your bug comments here and in other bug reports. To maintain a respectful atmosphere, please follow the code of conduct - http://www.ubuntu.com/community/conduct/ . Bug reports are handled by humans, the majority of whom are volunteers, so please bear this in mind.
Regards,
Dmitrijs.

vak (khamenya) wrote :

@Michael Brenden, i have had the same emotion on big failure of my expectations regarding 12.04, however we should have thought twice after looking at the name "Pangolin" -- it will be about bugs, bugs and bugs ;). Debian is much tot mach conservative and Ubuntu becomes "too liberal". Whatever.

try mdadm-3.2.5-1ubuntu from PPA. It seems to help me

@Dmitrijs many thanks for your job!

ceg (ceg) wrote :

Vak, if you need more recent versions, look for a howto to set up a debian (desktop) box with testing or sid.

These releases contain recent software with continious updates, while still being conservative on the finishing of features, and not just dumping them on users and forgetting about them.

After all, ubuntu copies and releases a debian version from before it is "stable", but seems to stop adding updates after releasing that premature version.

dino99 (9d9) wrote :
Changed in mdadm (Ubuntu Natty):
status: Fix Committed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments