Lucid: Gave up waiting for root device (mptsas) resolved by rootdelay

Bug #579572 reported by Hussein Abdallah on 2010-05-12
48
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Low
Unassigned

Bug Description

I found this bug after doing a fresh install of Ubuntu 10.04 LTS server (amd64) on a Dell PowerEdge R410, with hardware RAID (SCSI storage controller SAS1068E PCI-Express Fusion-MPT SAS (LSI Logic / Symbios Logic)). Before installing Ubuntu 10.04, this server was booting without a problem with CentOS 5.4 (Linux 2.6.18). When the installation is over and after the first reboot, I get this error message:
Gave up waiting for root device.
...
ALERT! /dev/disk/by-uuid/[mydiskUUID here] does not exit. Dropping to a shell!
Busybox v1.13.3 (Ubuntu 1:1.13.3-1ubuntu11) built-in shell (ash)
(initramfs)

If I wait between 10 and 20 seconds, I see:
35.538649] sd 4:1:0:0: [sda] 584843264 512-byte logical blocks: (299 GB/278 GiB)
[ 35.539484] sd 4:1:0:0: [sda] Write Protect is off
[ 35.539555] sd 4:1:0:0: [sda] Mode Sense: 03 00 00 08
[ 35.540034] sd 4:1:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 35.541687] sda: sda1 sda2 sda3 sda4 < sda5 >
[ 35.560048] sd 4:1:0:0: [sda] Attached SCSI disk
[ 36.521418] Adding 2096472k swap on /dev/sda3. Priority:-1 extents:1 across:2096472k
[ 36.761904] EXT3 FS on sda2, internal journal
[ 37.181651] EXT3 FS on sda1, internal journal
"

Then, I type exit and the boot resumes normally. I could reproduce this bug after every reboot.

WORKAROUND: I solved this issue by adding to /etc/default/grub :
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=50"

and running update-grub

Unlike Bug #576302 , I can solve the problem with rootdelay=50 and I don't use mptspi kernel module.

Hussein Abdallah (abdallah98) wrote :
description: updated
Hussein Abdallah (abdallah98) wrote :

I didn't have this bug on a Dell PowerEdge 1950 (SAS1068 PCI-X Fusion-MPT SAS) also using the module mptsas (however, it was 32 bits Ubuntu 10.04LTS server, not 64 bits).

ScottMarlowe (scott-marlowe) wrote :

I am having the same problem, and waiting one or two minutes then typing exit the machine boots right up. I'm adding the GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=120" line to my /etc/default/grub, we'll see if that fixes it or not.

Paul Parisi (pparisi) wrote :

Same problem here on two Dell R300 units (configured with three disks in a RAID 5). We had three machines (one R610 and two R300's) with virtually identical ubuntu 64bit software configurations and when they rebooted one day, due to a power outage the R300's never came back up but the R610 (running a newer, better SAS controller) recovered fine. Our Dell R300's have subsequently been removed from production due to this issue and are being replaced with HP units.

Discussed with dell and they says its the poor quality SAS controller in the lower cost hardware, like the R410 and R300.
They do something dodgy with the controllers to get the cost down.

Basically my understanding of the problem is this:
1. System starts boot process
2. Grub is loaded via BIOS routines (no problems)
3. Ubuntu kernal is bootscrapped into memory via BIOS routines and then run (no problems)
4. Ubuntu kernal loads and uses its the proper RAID driver to continue accessing the drives, however as it now has switched from BIOS routines to actual driver the RIAD is not ready for access, and hence everything dies.

According to dell CentOS probably caters for this issue in their driver. Ubuntu will need to do the same to cater for this fault in the Dell hardware.

The root delay trick works for us too, however we don't know what the root delay value should be to be confident with it for a production environment. We note the RAID slowed down over a period of a couple months, so a root delay of 120 might not be enough in the future... Just don't know enough about the poor quality dell gear to be sure. Dell had nothing further to offer on this, again usual comments about unsupported OS.

Anyway, the fix, it one can be found, needs to be tested on the dodgy SAS controllers installed in the R300's and other low cost series. Let me know if you need further technical details from our machines and I will post them up.

ScottMarlowe (scott-marlowe) wrote :

FYI, my machine is using an LSI 8888ELP controller with 32 Seagate 15K6 SAS drives connected.

It takes about 150 seconds for the exit after I get the error that the drives aren't ready for it to come up. Set the timeout to 180, hope that's enough.

As a side note, it takes 44 seconds to run "sudo lshw" on this machine, and almost all of that time is used to look up the SCSI subsystem.

Hussein Abdallah (abdallah98) wrote :

On the server I had this issue, lshw takes only 3 seconds, so it doesn't look to be always related.

Maarten Boot (mboot-nospam) wrote :

We are currently testing a way around this by using device independent paths:

Like found in Suse :

parts of /boot/grub/menu.lst

## ## Start Default Options ##
## default kernel options
## default kernel options for automagic boot options
## If you want special options for specific kernels use kopt_x_y_z
## where x.y.z is kernel version. Minor versions can be omitted.
## e.g. kopt=root=/dev/hda1 ro
## kopt_2_6_8=root=/dev/hdc1 ro
## kopt_2_6_8_2_686=root=/dev/hdc2 ro

<<< NOTE THE CHANGE HERE
# kopt=root=/dev/disk/by-id/scsi-3600508e00000000046e7e17454e23c05-part1 ro vga=791

## ## End Default Options ##

title Debian GNU/Linux, kernel 2.6.26-2-amd64
root (hd0,0)
<<< NOTE THE CHANGE HERE
kernel /boot/vmlinuz-2.6.26-2-amd64 root=/dev/disk/by-id/scsi-3600508e00000000046e7e17454e23c05-part1 ro vga=791 quiet
initrd /boot/initrd.img-2.6.26-2-amd64

title Debian GNU/Linux, kernel 2.6.26-2-amd64 (single-user mode)
root (hd0,0)
<<< NOTE THE CHANGE HERE
kernel /boot/vmlinuz-2.6.26-2-amd64 root=/dev/disk/by-id/scsi-3600508e00000000046e7e17454e23c05-part1 ro vga=791 single
initrd /boot/initrd.img-2.6.26-2-amd64

### END DEBIAN AUTOMAGIC KERNELS LIST

And accordingly /etc/fstab

rebooting since this morning, so far it works

Fabio Marconi (fabiomarconi) wrote :

Hello
Is this problem present with the latest updates ?
Thanks
Fabio

Changed in ubuntu:
status: New → Incomplete
Hussein Abdallah (abdallah98) wrote :

It is a production server so I can't remove the rootdelay and reboot it without being physically beside the server. I will do it whenever I have an opportunity.

Fabio Marconi (fabiomarconi) wrote :

Hello
can someone run apport-collect 579572
Thanks
Fabio

affects: ubuntu → linux (Ubuntu)
tags: removed: 10.04 busybox dell initramfs lynx mptsas poweredge r410 rootdelay ubuntu
Hussein Abdallah (abdallah98) wrote :

do I need to upgrade to latest updates before running apport-collect 579572 ?

Fabio Marconi (fabiomarconi) wrote :

Hello Hussein
Yes, if you can, update, and if the issue is fixed with the latest kernel then there's no need to run apport-collect, just report here.
If the issue still present then run apport-collect 579572
Thanks
Fabio

Marc Powell (marc-powell) wrote :

Hi Fabio!

I can confirm that this is happening on the latest kernel release (linux-image-2.6.32-27-server 2.6.32-27.49) with an R410. I've tried running apport-collect but I am not able to proceed past the Launchpad Login Service. I can enter my username and password but the 'Continue' option is not a live link. I can only cancel out. Can I get this information to you another way?

Daniel Gary (dgary1980-yahoo) wrote :

Hey Fabio,

Seeing the same problem with a new install of Maverick

only way to get around it is to set rootdelay

tags: added: maverick
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Hussein Abdallah, thank you for taking the time to report this bug and helping to make Ubuntu better. Please execute the following command, as it will automatically gather debugging information, in a terminal:
apport-collect 579572
When reporting bugs in the future please use apport by using 'ubuntu-bug' and the name of the package affected. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

description: updated
tags: added: needs-kernel-logs needs-upstream-testing
removed: server
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers