disk detection is real slow with some hardware (timout shell drops)

Bug #278176 reported by Michael Hipp on 2008-10-04
48
This bug affects 8 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned
mdadm (Ubuntu)
Undecided
Unassigned
mountall (Ubuntu)
Wishlist
Unassigned

Bug Description

I have installed Intrepid server i386 beta on a Dell PowerEdge 600SC. Everything seemed to install fine but upon boot it always drops into the BusyBox shell. The RAID is *not* degraded. At the BusyBox prompt if I type 'exit' it will proceed to boot normally and both RAID1 drives show healthy with all members. I have tried the install twice.

This has shown to be a timeout issue with slow hardware.

* initramfs: Default rootdelay may need to be larger and event driven upstart used.

* linux: The long delay may have its cause here.

* mdadm: As RAIDs may take minutes until they come up, but regular ones are quick, this should be handled nicely:
* This functionality is similar to and could most easily be added with upstart events (the temporary tool) mountall:

      "NOTICE: /dev/mdX required for the root filesystem didn't get up within the last 10 seconds.

      We continue to wait up to a total of xxx seconds complying to the ATA spec
      before attempting to start the array degraded.
      (You can lower this timeout by setting the rootdelay= parameter.)

      <countdown> seconds to go.

      Press [ESC] to stop waiting and to enter a rescue shell.

* see https://wiki.ubuntu.com/ReliableRaid

Dustin Kirkland  (kirkland) wrote :

Michael-

Can you also:
 $ cat /etc/initramfs-tools/conf.d/mdadm
and post here.

:-Dustin

Dustin Kirkland  (kirkland) wrote :

Also, would it be possible to post any messages on the console before dropping to a busybox shell? (Worst case, break out a digital camera and snap a shot of the monitor and post here.)

I'm curious if it's dropping to a busybox shell erroneously because it thinks the RAID is degraded, or if there's some other reason.

:-Dustin

Michael Hipp (michael-hipp) wrote :

"Also, would it be possible to post any messages on the console before dropping to a busybox shell?"

I think that's what these are, from above. Or did I misunderstand?
------------------------------------------
The console messages just before dropping into BusyBox are like this (transcribed):
    MP-BIOS bug: 8254 timer not connected to IO-APIC
    Loading, please wait...
    uvesafb: failed to execute /sbin/v86d
    uvesafb: make sure that the v86d helper is installed and executable
    uvesafb: Getting VBE info block failed (eax=0x4f00, err=-2)
      <pause for several seconds>
    Gave up waiting for root device. Common problems:
      <snip>
    ALERT! /dev/md0 does not exist. Dropping to a shell!

Michael Hipp (michael-hipp) wrote :

# cat /etc/initramfs-tools/conf.d/mdadm
# mdadm boot_degraded configuration
#
# You can run 'dpkg-reconfigure mdadm' to modify the values in this file, if
# you want. You can also change the values here and changes will be preserved.
# Do note that only the values are preserved; the rest of the file is
# rewritten.
#
# BOOT_DEGRADED:
# Do you want to boot your system if a RAID providing your root filesystem
# becomes degraded?
#
# Running a system with a degraded RAID could result in permanent data loss
# if it suffers another hardware fault.
#
# However, you might answer "yes" if this system is a server, expected to
# tolerate hardware faults and boot unattended.

BOOT_DEGRADED=true

Michael Hipp (michael-hipp) wrote :

I just tried the same install on a desktop board ASUS with nVidia chipset with one IDE and one SATA drive. Loaded and booted fine. So the problem may be related to the specific hardware in the Dell PowerEdge.

Sandro Mani (sandromani) wrote :

I get the same problem as mentioned above, using Intrepid Beta x64, hardware:
ASUS NCCH-DL, Adaptec 29320-R U320 SCSI Controller, RAID0 of two HDD's.
Note: the boot partition is on a separate drive, not in the RAID array.

Michael Hipp (michael-hipp) wrote :

Another data point: I just re-loaded the aforementioned Dell PowerEdge 600SC using Intrepid server beta but *no* RAID drives. The drive layout was a simple:

  /dev//sda1 / (root)
  /dev/sda2 <swap>

The problem still persists. I boots directly into BusyBox saying it can't find root on /dev/disk/by-uuid/hex-stuff.

Note that it has an Adaptec 39160 controller.

With the info by Sandro Mani above it looks to me like this might be related to the Adaptec controllers. A kernel module perhaps?

Michael Hipp (michael-hipp) wrote :

Another data point: I just installed on a different system with an Adaptec 29160 and no problems. Boots fine. I note that lspci identifies it as

01:04.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)

Which is different than the 39160.

Michael Hipp (michael-hipp) wrote :

Well, I seemed to have solved the issue, sort of.

Turns out (going from a clue given by the drop to BusyBox) I added rootdelay=40 to my grub kernel line. It now boots reliably.

So my kernel line in /boot/grub/menu.lst now looks something like this:

kernel /boot/vmlinuz-2.6.27-4-server root=UUID=401eb0e1-f624-4d86-a1a4-47374ba9a556 rootdelay=40 ro quiet splash

I determined this magic number by trial and error. It worked with 60; 30 wasn't enough. I think I'll raise it to a yet much higher number just to make sure it will always boot when I'm not here to babysit.

But this strikes me as a hack. Is there something bad going on that it should take so very long to gain access to a disk when, in fact, it is already reading from the disk (since grub is running)?

Divan (divan-santana) wrote :

I had the same problem.

Added rootdelay=40 fixed it for me too.

This occured for me when I upgraded from hardy to intrepid.

Sandro Mani (sandromani) wrote :

I'd suggest it is a general issue with Adaptec (SCSI?) controllers: I get the same issue with with a Adaptec 29160N U160 SCSI controller (non raid). Anyone to confirm this?

Coz (cosimo321) wrote :

Hey all,
 I had the same issue here with my adaptec 39160 controller, and ubuntu Intrepid release. This has happened several times over the course of ubuntu new releases, however its interesting that it rarely occurred during the pre-release periods!
  The rootdelay=40 "fixed" the issues...
 coz

Mike Hicks (hick0088) wrote :

I saw this as well. lspci reports that I have an Adaptec AHA-3960D controller. Since the device got detected a few seconds after the "(initramfs)" prompt appeared, I was able to boot the system by just typing "exit". I guess the startup scripts will re-scan for the boot device after the busybox shell exits (the first time, anyway).

Isn't the "correct" fix to add this to the commented "kopt=..." line like this?

## ## Start Default Options ##
## default kernel options
## default kernel options for automagic boot options
## If you want special options for specific kernels use kopt_x_y_z
## where x.y.z is kernel version. Minor versions can be omitted.
## e.g. kopt=root=/dev/hda1 ro
## kopt_2_6_8=root=/dev/hdc1 ro
## kopt_2_6_8_2_686=root=/dev/hdc2 ro
# kopt=root=UUID=d5373d24-cbe4-46be-a4cd-b3457985915a ro rootdelay=40

...and then, run the update-grub tool? This way, the changes you've made won't get overwritten the next time the update-grub script gets run during a kernel install/upgrade.

husfeldt (thomas-husfeldt) wrote :

Yes.. on compaq nc8430 the rootdelay=130 does it also..
That is in the boot menu (for not so technical users).

Press e, choose the "kernel line" add rootdelay=130 press 'esc', press 'b'

Pawel Tecza (ptecza) wrote :

I have Adaptec SCSI controller too on my old Dell PowerEdge 1550 server and the same
problem with booting Intrepid. But it's Adaptec AIC-7899P U160/m (rev 01).

I have / partition on RAID1 and LVM. More details you can see reading my comment to
bug LP #290153 (https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/290153).

rootdelay=90 workaround works for me, although I don't need to wait 90 seconds.
About 20-30 seconds is enough.

Trebacz (david-trebacz) wrote :

I'm seeing the same issue on Intel Server Board SE7501WV2 which has an Adaptec AIC-7902 SCSI controller on board. rootdelay=40 worked fine for my setup.

Jeff Kowalczyk (jfkw) wrote :

Data point: I see the sambe behavior on an olld Celeron 633Mhz Compaq Presario with a single IDE drive ;)

rootdelay=40 fixes, 20 did not.

Reece (reece) wrote :

I've got the same problem on an old Tyan S2463UNG w/dual Athlons running Intrepid. Although this mobo has Adaptec SCSI, no SCSI drives are installed. The box has two IDE DVD/CD drives and a Promise TX4 w/3 SATA drives. I also upgraded to Intrepid from Hardy.

rootdelay=40 apparently solved the problem.

i have "block devices found" on mdadm raid 10 and mdadm raid 1
rootdelay=40 helps

very helpful topic

metastable (info-metastable) wrote :

I don't have a RAID array on my server, but adding rootdelay works:
kernel /boot/vmlinuz-2.6.28-13-server root=UUID=d1f6f883-efff-43cd-af15-110b79b02bce rootdelay=90 ro quiet splash

SYSTEM:
Ubuntu 9.04
Linux srv########### 2.6.28-13-server #45-Ubuntu SMP Tue Jun 30 20:51:10 UTC 2009 i686 GNU/Linux
01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

Best regards,
Stijn Verholen

ceg (ceg) wrote :

Maybe this is actually a kernel / module issue. Are those controllers/disks also so slow with other OS?

Still valid with current releases?

summary: - Intrepid ubuntu server won't boot RAID1
+ disk detection is real slow with some hardware (timout shell drops)
ceg (ceg) wrote :

Concerning RAID degradiation, as RAIDs may take minutes until they come up, but regular ones are quick, this should be handled nicely:

      "NOTICE: /dev/mdX required for the root filesystem didn't get up within the last 10 seconds.

      We continue to wait up to a total of xxx seconds complying to the ATA spec
      before attempting to start the array degraded.
      (You can lower this timeout by setting the rootdelay= parameter.)

      <countdown> seconds to go.

      Press [ESC] to stop waiting and to enter a rescue shell.

* This functionality is similar to and could most easily be added with upstart events (the temporary tool) mountall.

* see https://wiki.ubuntu.com/ReliableRaid

ceg (ceg) on 2010-03-29
description: updated
Changed in mountall (Ubuntu):
importance: Undecided → Wishlist
status: New → Triaged
tags: added: review-request
Thomas Orgis (thomas-forum) wrote :

Just want to chime in to confirm what has been said so far. I experienced this some ubuntu releases ago (8.04, perhaps) with my Adaptec 39160. Fails to find root, big rootdelay helps.
I initially blamed this on ubuntu because my self-built kernels did not feature this. I always built the scsi driver into the kernel and also used root=/dev/sda2 instead of root=UUID=... .

Just recently I observed that the issue vanishes when replacing the dual-channel 39160 with a single-channel 29160 (I only used on channel anyway). So, is there some confusion with the initial scanning of two SCSI channels?
Perhaps the kernel reports that it's done only after one channel? But then, it apparently works when the driver is built into the kernel.

Unsure why there is a bug open on mountall here, since mountall already does precisely this.

If the filesystem doesn't show up, it shows a message that allows the user to skip it, or get a shell to fix it up - or just wait

Changed in mountall (Ubuntu):
status: Triaged → Fix Released
status: Fix Released → Invalid
Coz (cosimo321) wrote :

Hey guys,
  Apparently,, this is an on going problem ONLY with ubuntu.
I had commented on this above back in 2008, I have tried almost every distribution...not one of them have this issue with scsi drives and the adaptec 39160 controller.
  This IS specific to Ubuntu and at this point...since I have been using ubuntu almost exclusively,,,since it's initial release... a real pain in the butt! :)
I "do not" want to have a rootdelay=40 ...it is painful to wait nearly that long.
   This is a serious bug...and it needs to be addressed!
Why only ubuntu, particularly since ubuntu also has a server edition..which many servers are still using scsi and more likely than not.. an adaptec scsi controller card??
  This has been reported now, on this bug report.. since 2008.. !!!!!
In 5 months the next ubuntu release will be out !!!!
I hope someone makes this a priority for the next release...I am not interested in fancy plymouth splashes...or grub2 advances..etc.etc... fix this bug Please! It has been around for too long

coz

ceg (ceg) wrote :

Coz: So other distros dont't take ages to detect your disks?

Maybe the boot option "raid=noautodetect" helps in you case? Bug #551719

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 278176

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in initramfs-tools (Ubuntu):
status: New → Confirmed
Changed in mdadm (Ubuntu):
status: New → Confirmed
Jarmo (jjarven) wrote :

I have exactly the same problem with a different HW.

The setup:
Two raid 5 arrays on 4 USB disks and 3 internal disks with one lvm2 volume on top of the one raid 5 array.

The bootup goes to the recovery shell every time, but just by writing exit, the booting continues succesfully.
I believe the Buffalo usb disk is the slowest one to be recognized, thus causing this problem. No solution though.

Stephan Sperber (quinte17) wrote :

same problem here with ubuntu 12.04.
just the note "dropping to a shell" then busybox ctrl+d boot gets finished fine.

setup:
intel core i3 onboard 2 sata drives raid1->lvm2
pci promise ide controller 4 pata drives raid5->lvm2

Phillip Susi (psusi) wrote :

You can change the timeout if you have unusual hardware that needs longer, so I don't think we're going to change the default since it works for the vast majority of machines. There are also other bugs already about how mdadm handles degraded activation, so I'm closing that task too. If there is particular hardware that seems to take longer than it should to scan that may be a driver problem, we might look into that. Please attach /var/log/kern.log.

Changed in initramfs-tools (Ubuntu):
status: Confirmed → Won't Fix
Changed in mdadm (Ubuntu):
status: Confirmed → Invalid
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers