RAID goes into degrade mode on every boot 12.04 LTS server

Bug #990913 reported by RamJett on 2012-04-29
154
This bug affects 28 people
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
Medium
Unassigned

Bug Description

I have 2 new Dell PowerEdge R515.
Box have 2 internal SAS drives and 12 hot swap
I create a 12 driver raid 6 array on the 12 hot swap drives
without problem but on every reboot I get a message about
either degraded or not enough drives for the raid 6.

If I shutdown and pull all 12 drives and let boot. Then plug them in.
The array comes up fine most everytime.

This happens on both of these new R512

Also loaded 12.04 LTS server on a R710 and
the mpt2sas load and times out. I had to
add rootdelay=180 to boot parms.

Description: Ubuntu 12.04 LTS
Release: 12.04

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/990913/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
RamJett (2-rodney) on 2012-04-30
affects: ubuntu → initramfs-tools (Ubuntu)
Joseph Salisbury (jsalisbury) wrote :

Do you know if this issue happened in a previous version of Ubuntu, or is this a new issue?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc4-precise/

Changed in initramfs-tools (Ubuntu):
importance: Undecided → Medium
affects: initramfs-tools (Ubuntu) → linux (Ubuntu)
tags: added: needs-upstream-testing

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 990913

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
RamJett (2-rodney) wrote :

Installed linux-image-3.4.0-030400rc4-generic_3.4.0-030400rc4.201204230908_amd64.deb upstream kernel and still have same issue.

tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
RamJett (2-rodney) wrote :

Today I have found a way to bypass this problem but know there still is a problem. On the second R512
I changed the partition for the raid drives and root / boot. It was something like this: (it will bott good everytime on this box now)

/dev/sda1 /boot ext2
/dev/sda2 / ext4

/dev/sd[b,c,d,e,f,g,h,i,j,k,l,m]1 swap
/dev/sd[b,c,d,e,f,g,h,i,j,k,l,m]2 raid

/dev/md0 lvm

======= changed to scheme below

/dev/sda1 /boot ext2
/dev/sda5 lvm-root /
                    lvm-swap swap

/dev/sd[b,c,d,e,f,g,h,i,j,k,l,m]1 raid

/dev/md0 lvm

===========

So either this give it more time to initialize or it does not like swap accross all those drive, or my guess raid not being detect right on /dev/sd[b,c,d,e,f,g,h,i,j,k,l,m]2 but fine on 1

RamJett (2-rodney) on 2012-04-30
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Andrew Thrift (andyonfire) wrote :

I too am experiencing this bug on 12.04 with a SuperMicro LSI2008 card.
This was working perfectly on Ubuntu 10.x and 11.10, it is only with the upgrade to 12.04 that this problem has occurred.

Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report at bugzilla.kernel.org [1]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

If you are comfortable with opening a bug upstream, It would be great if you can report back the upstream bug number in this bug report. That will allow us to link this bug to the upstream report.

[1] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged

Not sure how this would be a upstream (kernel) bug. My tests do not
indicate that.

I just now tried a Gentoo live CD and the drives come up fine. I do have
to manually
run # mdadm -As at prompt, but that is because the live disk does not
autostart the raid

The Gentoo live disk is using Kernel 3.3.0

On 05/01/12 11:48, Joseph Salisbury wrote:
> This issue appears to be an upstream bug, since you tested the latest
> upstream kernel. Would it be possible for you to open an upstream bug
> report at bugzilla.kernel.org [1]? That will allow the upstream
> Developers to examine the issue, and may provide a quicker resolution to
> the bug.
>
> If you are comfortable with opening a bug upstream, It would be great if
> you can report back the upstream bug number in this bug report. That
> will allow us to link this bug to the upstream report.
>
> [1] https://wiki.ubuntu.com/Bugs/Upstream/kernel
>
> ** Changed in: linux (Ubuntu)
> Status: Confirmed => Triaged
>

Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu):
status: Triaged → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Also was there a previous kernel version that did not have this issue? Maybe you can test some prior kernels, such as the Oneiric kernel?

I am having the same symptoms, though if the issue is related it is not described sufficiently here.

What is happening is that the kernel initiates the md components, and then the init scripts continue before all controllers and disks are up. This happens about 5 seconds into the initramfs boot process.
In my case the message says 'degraded' because one of the disks in the raid is on an onboard SATA port, and the rest are on two LSI mptsas cards. I suspect it would simply not see any raid devices at that point if I moved that disk to a SAS card port as well.

An extract of the boot messages (copied by hand, so they may contain errors):
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
md: raid10 personality registered for level 10
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premounte ... ** WARNING: There appears to be one or more degraded RAID devices **
<snip text asking for continue yes/no with instructions how to set bootdegraded=true>

Then while dropped to the initramfs shell, the boot process continues:
ioc0: LSISAS1068E B3: Capabilities={Initiator}
scsi6 : ioc1: LSI53C1020A A1, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=16
scsi7 : iosc0: LSISAS1068E B3, FwRev=01210000h, Ports=1, MaxQ=483, IRQ=16
Etcetera.

This continues until 30 seconds into the boot process, when the kernel brings up the md devices:
md/raid:md0 raid level 6 active with 8 out of 8 devices, algorithm 2
<snip>
md/raid:md1 raid level 5 active with 4 out of 4 devices, algorithm 2
<snip>

Doing a cat /proc/mdstat at this point shows the devices are clean and healthy.

tim (acmeinc) wrote :

I am having same issue, mdadm RAID 5 degraded on almost every reboot (90%). I am free to test anything you would like.

Doug Jones (djsdl) wrote :

I am having similar problems. I am running Oneiric. I am NOT using LUKS or LVM.

Symptoms vary in severity a lot. Sometimes it simply drops a spare, and it's listed in palimpsest as "not attached". One click of the button and it's reattached, and shown as "spare".

But then sometimes it gets really hairy. These nightmares usually start when I shut down the system, and it appears to hang during shutdown.

Now that this has been going on for a while, I *always* check the status of my drives and arrays immediately before shutting down. First I shut down all apps, then I start palimpsest. I check the SMART health of all drives (all are healthy, except for one that has one bad block, and that never changes). Then I check the arrays; all are running and idle. I also drill down and check the array components, to make sure they are all attached. If I find one that isn't attached, I attach it. I don't shut down until everything looks good.

Then I shut down, and cross all my fingers and toes.

About half the time, shutdown never completes. It hangs on the purple screen, with "Ubuntu" and five dots that don't crawl. I watch the drive activity light; nothing. No drive activity at all.

Then I wait and wait and wait, wasting my valuable time (well, valuable to me anyway) until I get fed up.

Then I do what my mommy always told me, and shut down with Alt-SysRq

qe2eqe (qe2eqe) wrote :

I am also having the same issue, mdadm raid 5 (no lvm). Mine is compounded by the fact that the screen is entirely purple... if I reboot while booting and it asks me to choose a kernel, I see the screen 'boot degraded y/n' and then drops to the busybox shell. Interacting with the purple screen as if it were the busybox screen results in a completion of the boot process.
I just upgraded from 11.10, where I did not have this problem.

Doug Jones (djsdl) wrote :
Download full text (3.6 KiB)

(Ooops, apparently hit the wrong key... continuing the previous comment)

...shut down with Alt-SysRq REISUB. This has no effect whatsoever. The screen doesn't change; the drive activity light does nothing.

Finally, after stewing for a while longer, I hold down the power switch until I hear all the fans powering down.

Then I boot up. I see no error messages. Everything seems to be working fine, except the part about having to boot it three or four times before it actually gets past the GRUB splash screen and arrives at the Ubuntu splash screen. After that, everything looks great... I log in, and get to Unity, and I never saw any error message going by.

Then, the first thing I do is start up palimpsest and check the drives and arrays. The drives are always fine, but generally about half of the arrays are degraded. Sometimes it will start re-syncing one of the arrays all by itself; usually it starts with an array that I don't care so much about, and I can't do anything about the ones with more important data until later, because apparently palimpsest can only change one RAID-related thing at a time. Which means that sometimes I have to wait for maaaaaaaany hours to start working on the next array.

The worst I've seen was the time it detached two drives from my RAID6 array. Very scary.

I have one RAID6 array, one RAID10 array, and several RAID1 arrays. I think all of them have degraded at one time or another. This bug seems to be an equal opportunity degrader. Usually I find two or three of the larger arrays are degraded, plus several detached spares on other arrays.

This system has six 2TB drives. I think some of them have 512 byte sectors, and some have 2048 byte sectors; how the heck do you tell, anyway? All use GPT partitions, and care has been taken to align all partitions on 1MB boundaries (palimpsest actually reports if it finds alignment issues).

The system has two SATA controllers. I put four drives on one controller, and two on the other, and for the RAID1 and RAID10 arrays I make sure there are no mirrors where both parts are on the same controller, or both parts on drives made by the same company. Except, that isn't really true any more; whenever something gets degraded and I have to re-attach and re-sync, the array members often get rearranged. I think most of my spares are now concentrated on a couple of drives, which isn't really what I had planned. I've given up on rearranging the drives to my liking, for the duration.

In fact, for the duration, I've given up on this system. I've been gradually moving data off it, onto another system, which is running Maverick, and it will continue to run Maverick because it doesn't try to rearrange my data storage every time I look at it sideways. (Verrrrrry gradually, since NFS has been broken for the better part of a year...)

This nice expensive Oneiric system will be dedicated to the task of rebooting, re-attaching, and re-syncing, until Oneiric starts to behave itself. I am planning to also install Precise (multiboot) so I can test that too. Attempting an OS install while partitions are borking themselves on every other reboot sounds like fun.

BTW, ...

Read more...

Doug Jones (djsdl) wrote :

I have now installed Precise on my system. (I had intended to install as a multiboot, along with the existing Oneiric, but apparently the alternate installer could not recognize my existing /boot RAID1 partition, so now I can't boot Oneiric. But that's another story...)

Note that the title of the original bug report refers to 12.04 Server, but I have a Desktop system, installed with the Alternate disk.

This time I installed / on a non-RAID partition. My pre-existing RAID partitions are now mounted as directories in /media, except for /boot, which is still on the same MD partition as before.

I have now rebooted several times since installing 12.04. The previous behavior of hanging during shutdown has not recurred. Also, pleasantly, the previous behavior of hanging during boot (between the GRUB splash and the Ubuntu splash) has also not recurred.

I am getting error messages on the Ubuntu splash screen (under the crawling dots) about file systems not being found. I have seen these occasionally for many years, and have become quite accustomed to them. It says I can wait, or hit S to skip, or do something manually; I wait for a while, but soon give up on that and hit S because waiting NEVER accomplishes anything. I'm not sure why that option is even mentioned.

Fortunately, this has not been happening with my /, so I can successfully log into Ubuntu.

Once there, I start up palimpsest (Disk Utility) and look at the RAID partitions. Generally, about half of them are degraded or simply not started.

The ones that are not started are the ones mentioned in the error messages on the splash screen. I can start them from palimpsest; sometimes they start degraded, sometimes not.

After about an hour of work, all of the degraded partitions are fully synchronized. I usually have to re-attach some components as well. Haven't lost any data yet.

Sometimes I cannot re-attach a component using palimpsest and have to drop to the command line, zero the superblock, and then add the component. This has always worked so far. I only noticed this particular behavior since installing Precise.

In short: On this system, RAID usually degrades upon reboot. It did this with Oneiric (but only starting a few weeks ago) and it does this with a freshly installed Precise.

Around the time this behavior started with Oneiric, I did a lot of maintenance work on this hardware, including:

1) swapping out one hard drive

2) putting some 1.2 metadata RAID partitions on, where previously all were 0.90 metadata

I have not noticed any correlation between metadata version and degradation. Any of them can get degraded, in an apparently random fashion.

Between reboots, the system runs just fine. Hard drive SMART health appears stable. The newest hard drive is reported as healthy.

Doug Jones (djsdl) wrote :

Since my last comment, an updated kernel arrived via Update Manager. Its changelog included the following:

   * md: fix possible corruption of array metadata on shutdown.
    - LP: #992038

This seems possibly relevant. I updated, and have now rebooted several times. The RAID degradation is still happening, on every reboot. As before, the system runs just fine after I finish fixing up RAID.

I am now keeping detailed notes on which partitions are being degraded. Since it takes me anywhere from fifteen minutes to several hours to accomplish each reboot and ensuing repair, and I have other things to do as well, it will be a while before meaningful statistics are accumulated.

Further details I forgot to mention earlier: This is an AMD64 system with 8GB of ECC RAM. Have attached most recent dmesg.

Same for me.
System: Supermicro X9SCL-f CPU XEON E31220 RAM 16GB ECC 2xAdaptec 1430SA with 7xWD20EARS md-RAID5

Nearly every reboot ends in an degraded RAID with initramfs Prompt. Resuming boot apears good.
It seems to me a timing problem loading the needed modules.
So for my system helps:
/etc/initramfs-tools/initramfs.conf:
-MODULES=most
+MODULES=dep

or

-MODULES=most
+MODULES=list

with

/etc/initramfs-tools/modules:
async_xor
async_pq
async_memcpy
async_raid6_recov
raid6_pq
async_tx
raid456

Don't forget to run mkinitramfs:
 mkinitramfs -o /boot/initrd.img-{version} {version}
with actual version (3.2.0-24-generic) for me.

Hope this helps ^^

Doug Jones (djsdl) wrote :

Precise is using a 3.2.0 kernel. There is a known MD bug that affects some 3.2.x and 3.3.x kernels, that seems like it might be relevant to this problem. See:

http://www.spinics.net/lists/raid/msg39004.html

and the rest of that thread. Note the mention of possible racing in scripts.

Unfortunately for us, the lead MD developer does not test with Ubuntu, or with any other Debian-based distro. (He only uses SUSE.) So if there are any complex race conditions or other problems created by Ubuntu's udev scripts or configs or whatever, he might not uncover them in his testing, and the level of assistance he can provide is limited. (He and the others on the linux-raid list are indeed helpful, but I'm not sure that very many of them use Ubuntu, and the level of the discussion there is fairly technical and probably well beyond what most Ubuntu users could follow.)

Now that Canonical has announced the plan to eliminate the Alternate installer and merge all installer functionality (presumably including RAID) into the regular Desktop installer, it seems likely that the number of users setting up RAID arrays will increase. (I am using Desktop myself, not Server).

For some time now, it has been possible to set up and (to a limited degree) manage software RAID arrays on Ubuntu without any knowledge of the command line. So there are Desktop users who are using RAID arrays, thinking they are safeguarding their data. But when the complex creature known as linux software RAID breaks down, as it has with this bug, they are quickly in over their heads. Given that RAID bugs can destroy the user's data, just about the worst thing that can happen, it would seem prudent to either (1) actively discourage non-expert users from using RAID, or (2) make Ubuntu's implementation of RAID far more reliable.

vak (khamenya) wrote :

"dep" method from Andreas Heinze (andreas-heinze) didn't help me. as for "list" method I'm not sure what modules would be right for my case.

Brendan Lewis (sirblew) wrote :

I'm having this issue too. RAID 5 across 4 disks. All OS partitions are on a separate USB drive. Every boot since upgrading to Precise will hang with a degraded RAID and drop to root shell. I have to rebuild the array to get it to boot again. I don't reboot that often so sometimes forget that this is an issue but it's definitely only started since upgrading to 12.04.

Any ideas on how to work around this until it's fixed?

jonaz__ (jonaz-86) wrote :

I have the same issue.

The problem is that the /usr/share/initramfs-tools/scripts/mdadm-functions
Is called before all drives has been initialized.
I have 6 drives in RAID array. 2 of them are onboard SATA and 4 are on mpt2sas (SAS2008) card.

Apparantly mdadm tries to initialize the array before all 6 drives have been attached to the system

If i edit mdadm-functions to this everything works (ugly fix):

degraded_arrays()
   {
       sleep 15
   >---mdadm --misc --scan --detail --test >/dev/null 2>&1
   >---return $((! $?))
  }

This bug was introduced for when i upgraded from 10.04 LTS to 12.04.1 LTS Today!

Graeme Christie (gradme) wrote :

+1

graemec@tosser:~$ uname -a
Linux tosser 3.0.0-12-generic-pae #20-Ubuntu SMP Fri Oct 7 16:37:17 UTC 2011 i686 athlon i386 GNU/Linux

Simon Bazley (sibaz) wrote :

I have a similar problem, but suspect the issue I'm having means it must be either down to code in the kernel or options unique to my ubuntu/kernel config. /proc/version reports: 3.2.0-30-generic.

In my case, I have 5 disks in my system, 4 are on a backplane, connected directly to the motherboard, and the 5th is connected where the cd should be.

These come up on linux as /dev/sd[a-d] on the motherboard and /dev/sde on the 5th disk.

I have uinstalled the OS entirely on the 5th disk, and configured grub/fstab to identify all partitions by UUID. fstab does not reference any disks in /dev/sd[a-d].

The intention being, to install software raid on the a-d disk, to present as /dev/md0

I created a RAID5 array with 3+spare, and one of the disks died. So I have a legitimated degraded array, which the OS should not need to boot.

However it won't boot either with 'bootdegraded=true' or not

Not sure editing mdadm functions will help as really I don't want any md functions to run at initramfs time. They can all wait until after it's booted.

Any thoughts on how I can turn off mdadm completely from initramfs?

Simon Bazley (sibaz) wrote :

I thought it helpful to link to https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/872220 which seems related, although pertains specifically to inappropriate boot failures due to irrelevant disks being degraded rather than racing conditions coming from disks being erroneously degraded.

It's stating the obvious but I found I could boot normally with degraded softraid disks by uninstalling mdadm, which removes the mdadm scripts from the /usr/share/initramfs-tools folder and the initrd image.

I'll try to find a fix for my problem, but thing 872220 is a more appropriate place for and fixed.

Dimitri John Ledkov (xnox) wrote :

Have you tested with mdadm from precise-updates / quantal? And can you still reproduce this issue?
2012-08-13 had mdadm 3.2.5-1ubuntu0.2
It now waits for udev settle before dropping to degraded arrays.

tim (acmeinc) wrote :

For the past 2 months or so I haven't had a degraded array as a result of a reboot. In this time I have rebooted my computer at least 25 times for various reasons. If there was an update, I didn't notice it in the list of updates from apt. Though, for all intensive purposes, my issues have gone away, especially considering nearly 90% of reboots previously resulted in a degraded array, and 0% since.

Tom Mercelis (tom-mercelis) wrote :

What is the status of this bug supposed to be in Ubuntu 12.04 with kernel 3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64?

I updated to this version (coming from 11.10) last weekend, and now my "nested" raid always starts in degraded mode.

What I can reproduce every time:
Power up system, it shows a message saying the device md5 is starting in degraded mode. I can log into Gnome (so I'm not stopped by any busybox).

I have three soft RAID devices reported by /proc/mdstat
md3 : active raid1 sdd1[0] sdc1[1]
      521984 blocks [2/2] [UU]

md5 : active raid5 sdc3[2] sdd3[1]
      3859889152 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]

md4 : active raid0 sdb1[0] sda1[1]
      1931640448 blocks super 1.0 64k chunks

As you can see, md5 claims to missing a device. This is not true; the device it's missing is md4.
I can add this device:
mdadm --add /dev/md5 /dev/md4
mdadm: added /dev/md4

Which results in this in /proc/mdstat:

Personalities : [raid0] [raid6] [raid5] [raid4] [raid1] [linear] [multipath] [raid10]
md3 : active raid1 sdd1[0] sdc1[1]
      521984 blocks [2/2] [UU]

md5 : active raid5 md4[3] sdc3[2] sdd3[1]
      3859889152 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [>....................] recovery = 0.0% (1840384/1929944576) finish=568.6min speed=56515K/sec

md4 : active raid0 sdb1[0] sda1[1]
      1931640448 blocks super 1.0 64k chunks

unused devices: <none>

after about 10 hours of syncing, the system is running fine.

But when I reboot; md5 seems to be started without md4.

The mdadm.conf file hasn't changed since my 11.10 installation, it is in my initrd.img. (see attachment)

Tom Mercelis (tom-mercelis) wrote :

Sorry, previous post wasn't completed. To this post attached: the dmesg output. It seems md finds md4 before it starts with md5. Yet, /proc/mdstat returns another order, and most importantly: md5 is started in degraded mode.

What i tried:
- added "rootdelay=6" as kernel option in grub
- added this in /usr/share/initramfs-tools/scripts/mdadm-functions
degraded_arrays()
{
 echo "snoozing another 10 seconds for RAID"
 sleep 10;
 mdadm --misc --scan --detail --test >/dev/null 2>&1
 return $((! $?))
}
(and update-initramfs afterwards).
- added "containers" in the DEVICE line of mdadm.conf, as the manual suggests: "The word containers will cause mdadm to look for assembled CONTAINER arrays and included them as a source for assembling further arrays."

None of those solved the problem. What should I do next?

Phillip Susi (psusi) wrote :

This would be an issue with the mdadm init scripts, not the kernel. Does adding rootdelay=180 solve the problem for everyone?

affects: linux (Ubuntu) → mdadm (Ubuntu)
Tom Mercelis (tom-mercelis) wrote :

No, adding rootdelay=180 doesn't solve the problem. I now have configured the system not to start with a degraded array. Which results in the boot process being interrupted by a initramfs/rescue shell. In that shell I stop the raid (mdadm --stop /dev/md5) and re-assemble the arrays: mdadm -A --scan, then exit the rescue shell and the system boots fine with the raid fully operational and no hours of re-syncing. So it seems to me that something affects the order in which the raid devices are brought up. I've been trying different settings for the root delay, even found some bios options to give the disks more time to start before grub is started, but only manually stopping the degraded array and reassembling the array from the rescue shell works so far.

Kind regards

Phillip Susi (psusi) wrote :

Can you post the output of blkid and mdadm -D /dev/md5?

Tom Mercelis (tom-mercelis) wrote :

of course, see attachment

Phillip Susi (psusi) wrote :

You appear to have a raid5 built out of two disks and another raid device, which I think is the problem.

Tom Mercelis (tom-mercelis) wrote :

That's correct, one of the 3 devices is another array. But this setup worked in previous Ubuntu releases. If I remember correctly I created the array in Ubuntu 10.04, and it has since worked in 11.04 and 11.10. It's always been the same array, I know that at least once it was a "clean" new install (new root partition), and from 11.10 to 12.04 was an upgrade in the same root partition.

So something must have changed in the startup scripts that start the arrays for the root device; or in md. And the fact that I can still get it working by re-assembling the array in rescue shell, make me believe it's in the scripts.

fwiw i found a workaround. first i thought it might involve containers but i couldn't get those working at all. so i cheated.

eg: (compressed process, took place more incrementally)

mdadm --create /dev/md/vol0 --level=0 --raid-devices=2 /dev/sda /dev/sdb
mdadm --create /dev/md/vol1 --level=0 --raid-devices=2 /dev/sdc /dev/sdd
mdadm --create /dev/md/vol2 --level=linear --raid-devices=2 /dev/sde /dev/sdf

(thus three 3T arrays created; already have 3 3T single drives too):

mdadm --create /dev/md/Vault --level=5 --raid-devices=6 /dev/sdg /dev/sdh /dev/sdi /dev/md/vol0 /dev/md/vol1 /dev/md/vol2

(and put a filesystem on it):

mkfs.ext4 -L Vault (other-options) /dev/md/Vault

(I think you got that far already. This is the workaround):

echo "AUTO -all" >> /etc/mdadm/mdadm.conf

(don't automatically assemble any arrays, just those listed in mdadm.conf. Specifically):

mdadm --detail --scan | grep vol >> /etc/mdadm/mdadm.conf

(So, we only assemble the sub-arrays automatically during normal mdadm startup)

Then into /etc/rc.local:

mdadm --assemble --config=partitions --scan

(*now* auto-enable any other arrays you can. /dev/md/Vault gets assembled then.)

sleep 10

(probably excessive delay to make sure it's assembled

mount -L Vault -o rw,nosuid,nodev,relatime,uhelper=udisks2 /media/Vault

so yeah, you can't have Vault mounted in fstab because the array isn't created in mdadm startup; so this *is* a workaround but it *does* work.

couple of minor elaborations to the above:

update-initramfs -u

so the edited mdadm.conf gets into the initramfs

and in /etc/rc.local:

mdadm --assemble --config=partitions --no-degraded --scan

so it doesn't try to start the array if it would be degraded, thus probably requiring a rebuild - as the likely reason for that at this stage is because the update-initramfs step was missed above; or some other failure to start the sub-arrays. :-}

Phillip Susi (psusi) wrote :

Just having the arrays listed in mdadm.conf ( and updating the initramfs ) is not enough without modifying rc.local?

fairly sure I tried that, and no, it didn't work. Either the order of the arrays in mdadm.conf isn't significant, or they initialise in too-quick-succession and the one that depends on others is started too early, and thus fails. That's how I ended up here. :-)

thinking about it, when I tried that before i may have missed two elements, both of which were probably necessary: The "AUTO -all" line to prevent auto-assembly by scanning, and the update-initramfs step, as I learnt about both of those things after I'd given up trying to do it that way and was working on doing it the way I eventually ended up with. :-)

As I understand then, it would merely be sufficient to do (assuming older relevant lines are removed):

echo "AUTO -all" >> /etc/mdadm/mdadm.conf
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
update-initramfs -u

I'm a bit wary of doing it on my system now, especially as it's in the middle of a reshape, but I may try it later unless someone else reports here that it doesn't work. :-)

Tom Mercelis (tom-mercelis) wrote :

I added the AUTO -all to my mdadm.conf and updated the initramfs. The result is that no array is assembled automatically at startup and I enter the recovery shell. Then I perform an mdadm -A -s and all arrays are assembled correctly. This assembly takes about 2-3 seconds and correctly assembles the array depending on another array. I'm not sure what script causes the wrong assembly, but the mdadm tool itselfs seems to do it fine.

Andrew Martin (asmartin) wrote :

I have also noticed this bug on an Ubuntu 12.04 server. The workaround I've come up with is:
* install the backported Quantal kernel (3.5.x) by installing the linux-generic-lts-quantal package
* add the following patch to /usr/share/initramfs-tools/scripts/mdadm-functions:
--- /tmp/a/mdadm-functions 2013-07-01 12:28:46.896519157 -0500
+++ /tmp/b/mdadm-functions 2013-07-01 12:28:55.136677837 -0500
@@ -3,6 +3,9 @@

 degraded_arrays()
 {
+ udevadm settle
+ echo "Waiting for RAID arrays to be ready..."
+ sleep 20
        mdadm --misc --scan --detail --test >/dev/null 2>&1
        return $((! $?))
 }

Note that just adding "udevadm settle" was not enough in my case - the sleep was also required.

Alex Sorokine (sorokine) wrote :

Looks like exactly the same issue affects 13.04 with 3.8 kernels. I have a system with OS installed on SSD and also it has a 2-disk 1TB RAID1 with data. After upgrading to 13.04 I've started to get a message about failed RAID on every boot like described in this ticket. The RAID in fact was not a failure. As a workaround I am forcing grub to boot into 3.5.0-34 kernel which does not have this problem. I noticed this problem with kernels 3.8.0-26 and 3.8.0-27.

Thomas Maerz (thomasmaerz) wrote :

I was experiencing this issue on 12.04. I rebuilt the system with 12.10 when that came out. The issue persisted throughout all updates during the period between release of 12.10 and 13.04. I did a dist-upgrade to 13.04 and the issue still persists.

AMD64 plaform, server dist. I am using the onboard 890GX SATA Controller for some of the MDADM drives and some of them are on a SuperMicro SAS2LP card.

Tom Mercelis (tom-mercelis) wrote :

I started experiencing this problem on 12.04, it persisted after upgrading to 12.10 and even persisted after a clean install (on an empty partition) of 13.04. It also happens when I just boot from a USB stick with 13.04.
This is always with the same partitions.... and I'm guessing that might be part of the problem. These partitions were created years ago, and have different versions. The raid5 partition has 0.90 metadata, but the raid0 partition used as part of the raid5, has version 1.0 metadata. Could this cause the problem?
Still... isn't there anyone who can explain why the regular startup of Ubuntu fails to start these raids, and doing mdadm --stop on all devices and then running mdadm -A -s does succeed, even on a USB bootstick without any /edc/mdadm.conf?

Thomas Maerz (thomasmaerz) wrote :

Tom Mercelis,

I have wiped the array all the way down to the individual disk partition tables multiple times and the issue still persisted for me. I have finally solved this problem by switching to CentOS last night. Looks like that's going to be the only solution to this for a while, since this issue has been open for a year and a half and they haven't done much to solve it.

krutemp (krutemp) wrote :

I all, I have the same problem with a fresh installation of Ubuntu 12.04.3 server, but I found a workaround that seems to work, without reinstallation.

Problem:
I firstly installed the os on 2 hd configured as raid1 with swap (md0) and / (md1) partitions.
Secondly I added 2 more disks and via webmin I created a new raid1 with a single ext4 partition (md2) and have it mounted it at boot under /mnt/raid (again, via webmin).
It worked for a week, both md1 and md2, and very likely I updated the kernel image via aptitude (now I have 3.8.0-30-generic ) but I haven't rebooted it after. Today I needed to reboot it and it didn't. It didn't even show the boot messages.
At boot time, changing the boot options and removing the line gfx_mode $linux_gfx_mode, made it output the boot status messages, showing it was eventually ending to a initramfs console.
It did boot when choosing to use an older linux version from the boot menu (like 3.8.0-29-generic), even if during the boot process it asked if i wanted to skip the raid setup for md2. With the previous image it showed the boot output.
However, it was not able to mount md2 and showed a new raid md127.

Solution:
So after reading this report and other pages, here is the workaround I used:
1. booted with a previous version and logged in
2. changed /etc/mdadm/mdadm.cnf
From:
# definitions of existing MD arrays
ARRAY /dev/md/0 metadata=1.2 UUID=5e60b492:1adc87ee:8e9a341d:df94a8da name=my-server:0
ARRAY /dev/md/1 metadata=1.2 UUID=799d1457:5131d974:1ca3384e:ce9e9e77 name=my-server:1
# This file was auto-generated on Wed, 11 Sep 2013 14:33:45 +0200
# by mkconf $Id$
DEVICE /dev/sdc1 /dev/sdd1
ARRAY /dev/md2 level=raid1 devices=/dev/sdc1,/dev/sdd1

To:
ARRAY /dev/md/0 metadata=1.2 UUID=5e60b492:1adc87ee:8e9a341d:df94a8da name=my-server:0
ARRAY /dev/md/1 metadata=1.2 UUID=799d1457:5131d974:1ca3384e:ce9e9e77 name=my-server:1
ARRAY /dev/md/2 metadata=1.2 UUID=563e2286:2f0115a0:bc92cb19:3383af19 name=my-server:2

That is I changed the configuration of the 3rd raid using the UUID of the disks.
To get the UUID of the disks, I used sudo blkid. sdc1 and sdd1 had the following values:
/dev/sdc1: UUID="563e2286-2f01-15a0-bc92-cb193383af19" UUID_SUB="6ec201e5-eaa4-3309-054e-8ec96c26da49" LABEL="my-server:2" TYPE="linux_raid_member"
/dev/sdd1: UUID="563e2286-2f01-15a0-bc92-cb193383af19" UUID_SUB="722bad2f-37f4-3c8e-5619-ca413c9d0b82" LABEL="my-server:2" TYPE="linux_raid_member"

That is, I used for the UUID of the raid md2 the UUID of sdc1 & sdd1 and their label my-server:2. Note that the UUID of sdc1 & sdd1 and label are the same but and the only difference is the UUID_SUB.

3. I changed /etc/fstab removing the entry related to md2 and replaced with one using UUID:
UUID=0ae06695-1a89-4c05-af34-68465224a71c /mnt/raid ext4 defaults0 0

4. I issued a "sudo update-initramfs -u"

5. reboot.
It worked for me. I think there are some issues when not using the UUID values.

I just upgraded an Ubuntu Server 10.04 LTM to 12.04 LTM and was affected by this bug.

# mdadm --version
mdadm - v3.2.5 - 18th May 2012

# uname -a
Linux srv001 3.2.0-70-gerneric-pae #105-Ubuntu SMP ...

My system is a Dell PowerEdge 2900 with four 250 GB WD Harddrives.

Three of them are configured to build an RAID5 using mdadm (Metadata Version is 0.90, creation Time was May 2007. Chunk size is 64K, Layout left-symmetric.). The root filessystem is NOT part of the RAID.

Since I upgraded to 12.04.5 I get following messages from initramfs every time I reboot the system:

Begin: Running /scripts/init-premount ... done.
Begin: Mounting root files system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... *** WARNING: There appears to be one or more degraded RAID devices **

The system may have suffered a hardware fault, such as a disk drive
failure. The root device may depend on the RAID devices beeing online. One
or more of the following RAID devices are degraded:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdc1[0](S) sdd1[1](S)
           488279296 blocks

unused devices: <none>
You may attempt to start the system anyway, or stop now and attempt
manual recovery oprations. To do this automatically in the future,
add "bootdegraded=true" to the kernel boot options.
[...]

I cannot skip dropping into a shell:

Dropping to a shell.
(initramfs)

It is possible to just exist the initramfs and go on booting. Systems comes up and I can manually repair the raid by adding missing disks / recovering if necessary.

Maybe it is not related to that issure but while booting I get the following error messages for each of the four drives:

ata_id[234]: HDIO_GET_IDENTITY faild for '/dev/sdb': Invalid argument
ata_id[236]: HDIO_GET_IDENTITY faild for '/dev/sdd': Invalid argument
ata_id[235]: HDIO_GET_IDENTITY faild for '/dev/sdc': Invalid argument
ata_id[237]: HDIO_GET_IDENTITY faild for '/dev/sde': Invalid argument

I can manually reproduce this by running "hdparm -i /dev/sdc":
/dev/sdc:
 HDIO_GET_IDENTITY failed: Invalid argument

Regarding to bug #1029822 it may help to add an additional "exit 0" in line 2 of
/usr/share/initramfs-tools/scripts/local-premount/mdadm
and run "sudo update-initramfs -u" and reboot. (Haven't tested this solution yet)

I tested adding "sleep 15" but without success.

There is a good summery of RAID related issues on
https://wiki.ubuntu.com/ReliableRaid

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers