mdadm runs into infinite loop and prevents initrd/initramfs phase to finish on boot

Bug #1335642 reported by vak on 2014-06-29
198
This bug affects 41 people
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
Undecided
Unassigned

Bug Description

Hi all,

probably the issue is caused by changing the SATA-port where RAID disks were originally attached -- this is the only thing that happened before my Ubuntu 14.04 (kernel-3.13.0-27) became unbootable.

During the boot the OS runs into infinite loop reporting from time to time these:

incrementally starting raid arrays
mdadm: Create user root not found
mdadm: create group disk not found
incrementally started raid arrays

Since mdadm is in initrd image, I don't even know how to skip mdadm -- workarounds are very welcome, please!

For example, renaming /etc/mdadm/mdadm.conf in initrd image didn't help, mdadm then just repeats infinitely without group/user-related errors:

incrementally starting raid arrays
incrementally started raid arrays
incrementally starting raid arrays
incrementally started raid arrays
...

Disks are successfully assembled into a RAID under LiveCD Ubuntu 13 (yes , i have only old CD here)

Why I consider this a bug? because my RAID-array is needed at application level, not on OS-level (e.g. it is not mounted as / or /boot). So, if one can't boot into OS it is a serious bug. Last but not least, I was never choosing to put mdadm into initramfs, so it is a decision taken by default that leads to not-bootable system...

UPDATE1:
in scripts/mdadm-functions of initrd image i see this:

mountroot_fail()
{
    message "Incrementally starting RAID arrays..."
    if mdadm --incremental --run --scan; then
        message "Incrementally started RAID arrays."
        return 0
    else
        if mdadm --assemble --scan --run; then
            message "Assembled and started RAID arrays."
            return 0
        else
            message "Could not start RAID arrays in degraded mode."
        fi
    fi
    return 1
}

i tried 'mdadm --incremental --run --scan -c <path to etc/dmadm/dmadm.conf of initrd image> -v' and it exits silently without creating /dev/md/127 as I have expected.

Whereas

sudo mdadm --assemble --run --scan -c ./etc/mdadm/mdadm.conf -v

Does the job. Is it an issue just of incremental mode?

vak (khamenya) on 2014-06-29
description: updated
Aitor Andreu (foreveryo) on 2014-07-14
no longer affects: mdadm
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed

Hello all!

Problem with same symptom (loops on the same massage).
My system boots on a USB stick and has 4 sata disks in RAID 5 with 3 active and a hot spare. One of the disks failed and was kicked
out of the array (probably a transient failure, needs further investigation). For some reason I then lost all access means to the machine, although Apache was still responding. The machine did not react to an ACPI reboot and I hard rebooted it while it was rebuilding the RAID on the spare disk ...
It should have come up fine on 2 disks, but instead was stuck with the four line beginning with "Incrementally starting RAID arrays".
Some keying after boot in recovery mode lets me down to a root shell on initdisk. I can see that the RAID is active forced read only on two disks out of three. If I stop it and assemble it again from the working disks, it is just active, ready to go and can be mounted.
I have re-added the failed and spare partitions: if the failing one fails again rebuilding will proceed on the hot spare. Reconstruction of the missing RAID component is ongoing.

I have operated Linux RAID setups for many years and I have had several disk failures, some transient, some final. The boot scripts can handle that in 10.04 LTS and 12.04 LTS. In 14.04 LTS something changed: it may be mdadm choosing by default to assemble the array read only?

Hoping to provide useful information.

Harald Staub (staub) wrote :

Without really understanding, I tried to hack something and found the following bits, use at your own risk. This is on a degraded mdadm RAID1 with the root filesystem as a LV.

The following procedure was found to help:

  * Interrupt grub and edit the linux command line.
  * Go to the end of the line that starts with "linux".
  * Append "break" to the line.
  * Press "Ctl-X" to boot. You will get a busybox prompt.
  * udevadm trigger --action=add
    * After that, "ls /dev/mapper" shows what is needed for the root parameter: HOST--vg-root
  * exit

The following procedure was found to make the above hack persistent:

  * Add new file /usr/share/initramfs-tools/scripts/init-premount/10hack-raid-udev

#!/bin/sh
sleep 5
udevadm trigger --action=add
exit 0

  * update-initramfs -u

Mitar (mitar) wrote :

I had the same problem and after searching I found out that this solution (https://bugs.archlinux.org/task/33851#comment106076) worked for me. I just had to rename all `/dev/md/*` devices to `/dev/md*` devices in `/etc/mdadm/mdadm.conf` and run `update-initramfs -u` to update the initramfs.

Marc Mance (mmance) wrote :

It seems like there are many solutions to this error message. I have tried quite a few myself to no avail. Something I installed via apt-get broke my mdadm. There is a question posted over https://answers.launchpad.net/ubuntu/+question/248396 that also pertains to this.

can anyone chime in on:

/dev/md/0 to /dev/md0 change in mdadm.conf? what is messing this up? what is it suppose to be?

I have two raid0 with 3 drives each. One shows /dev/md/0 and the other shows /dev/md1. ???
I have tried both ways to no avail.

or modules like raid1 or dm-mirror for initramfs?

I have tried raid1, dmraid, dm, dm-mirror

This machines worked great for a week, I rebooted for some updated packages to experience this. The same thing happened on an almost identical machine.

William Law (wlaw) wrote :

Hi,

Slammed into this last night in a bit of an emergency situation. If some poor soul is running ubuntu this might help resolve it:
http://serverfault.com/questions/593734/mdadm-boot-error-incrementally-starting-raid-array-ubuntu-server-14-04

What is the status of the bug? It sounds like it happens depending on your storage controller; I guess I'd be happy to help get this actually fixed.

Wil

@staub
> After that, "ls /dev/mapper" shows what is needed for the root parameter: HOST--vg-root
Would you clarify where this is needed?

Harald Staub (staub) wrote :

@jtm-moon-forum-user+launchpad

So in my case, the problem is a missing device node. Not just anyone, but the one that is given to the kernel with the command line parameter "root=...". This is needed quite early during boot. That is why I inject "break", which means "break=premount", which is even earlier. This gives me the opportunity to create the missing device node by calling "udevadm trigger --action=add". The mentioned "ls /dev/mapper" is just a (very short) explanation to show that this call was successful, the device node is indeed there now.

If you want to have a closer look into the initramfs stuff: This looks like a race condition because this "udevadm trigger --action=add" was already called from /initramfs-tools/scripts/init-top/udev.

Harald Staub (staub) wrote :

It is well possible that while I have the same symptom as the original reporter, the underlying cause is different.

Wladimir Mutel (mwg) wrote :

I fixed my issue by the recipe from #4
I think it would be extremely beneficial to integrate some kind of fix to this problem into _LTS_ release which is expected to be extremely reliable and stable
Thank you in advance for your efforts.

Cal Leeming (sleepycal) wrote :

Indeed the fix from #4 seems to solve the problem for me. I can confirm that the update from #9 is likely right, this is certainly a race condition, sometimes it will boot without problem and other times it gets stuck.

If I run `ls /dev/mapper` from busybox before executing the `udevadm` command, it doesn't show the LVM volumes. Once I run that command, the devices then appear.

My drive configuration is as follows:

RAID 1 on 2 drives
LVM on Software RAID
Root partition on LVM, not separate boot partition
No encryption

This was an out of the box install of 14.04.3 LTS, and should almost certainly be treated as a serious bug

Cal Leeming (sleepycal) wrote :

Fwiw, I can also confirm that my RAID array is degraded, which indeed looks related to #645575.

Rovanion (rovanion-luckey) wrote :

I've gotten this issue after reordering the SATA ports my devices are plugged into. The MD-device assembles fine on a Live CD and using the steps in #4 in a chroot on the Live CD got the system booting again, but only after having the error reported in the original issue repeated three or four times:

>incrementally starting raid arrays
mdadm: Create user root not found
mdadm: create group disk not found
incrementally started raid arrays

So the machine now takes about 5 minutes to boot instead of 10s.

Hey,
I can confirm that #5 alone fixes the problem, which is way nicer than the hackaround in #4.

I solved by deleting /etc/default/grub.d/dmraid2mdadm.cfg file which added the string "nomdmonddf nomdmonisw" to the kernel load options. These in turn were the culprit for boot loop. At least in my case.

Diego Morales (dgmorales) wrote :

I came across this while purposely breaking my sw RAID array to test if booting through the mirror disk or degraded array was OK.

Comment #4 worked for me.

ls /dev/mapper on the initram shell really showed that my root device (which is on LVM) was not available before running that udevadm command.

There was another catch that took me over an hour: I starting this by physically removing the sda disk. But later I reinserted it, and keep debugging like that most of the time, without resyncing the arrays, 'cause I wanted it to come up fine in that state. So I when I ran update-initramfs only the files in sdb got updated.

*It took me a while to realize that the boot process was reading the initramfs files from the un-updated re-inserted sda disk.* I then re-synced my /boot and the workaround (#4) worked, duh.

Diego Morales (dgmorales) wrote :

Forgot to add that I am using Ubuntu 14.04 on an IBM x3630 M4 hardware. 12.04 apparently didn't manifested this problem.

Moe (m-ubuntuone-moschroe) wrote :

Hi,

as I see it, #1334699 is the bug producing the symptoms described in the initial post. Because of that, the fixes seem somewhat arbitrary (due to many causes showing the same symptoms). Nastily hides true problems.

In my case, the underlying issue also was that a device could not be found (see https://serverfault.com/a/594833).

/boot/grub/grub.cfg showed in the cmdline root=/dev/md0p2 (the system was set up in a live environment where in a script /dev/md0 was hard-coded as raid device). Removing the quiet parameter allowed to glance mentions of md127 prior to entering the infinite loop.

After changing that to the proper form of root=/dev/disk/by-id/yadda-yadda, the system came up just fine. Is there an issue for trying to reliably set the root parameter? I would expected it to be most useful to always use UUIDs here.

PS: WTF. Tentative first post. Was unable to find any info on community rules, best practices, etc. Hope this comment helps someone out there.

Thomas Mayer (thomas303) wrote :

I've just seen this infinite loop in 16.04 with kernel 4.15.

For me, this was just a leftover because I was transitioning from software raid to btrfs raid. Therefore, I had commented out the old software raid definition in /etc/mdadm/mdadm.conf. Which in turn brought me to the loop problem.

Can't the boot process just continue as long as long as no raid is defined in mdadm.conf? In any case, there should not be a reason to loop here because it is unlikely that mdadm.conf changes during boot.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers