kernel 2.6.32-32 makes machine hang during boot

Bug #576001 reported by CharlesA
56
This bug affects 11 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

After updating from Kernel 2.6.32-31-server to 2.6.32.32-server my machine just hangs at "fsck" and doesn't load any further.

I had to revert to 2.6.32-31-server so the machine would get past the first part of the boot process.

System specs:

Gigabyte EP45-U3DR
Pentium E6500
4GB DDR2 800 RAM

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-21-server 2.6.32-21.32
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-21.32-server 2.6.32.11+drm33.2
Uname: Linux 2.6.32-21-server x86_64
NonfreeKernelModules: rr26xx
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D2', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/pcmC0D1c', '/dev/snd/pcmC0D1p', '/dev/snd/pcmC0D2c', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info: Error: [Errno 2] No such file or directory
Card0.Amixer.values: Error: [Errno 2] No such file or directory
Date: Wed May 5 13:01:01 2010
HibernationDevice: RESUME=UUID=7a3e3184-da98-43fc-83be-a257fb727b50
InstallationMedia: Ubuntu-Server 10.04 LTS "Lucid Lynx" - Release amd64 (20100427)
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.

 vboxnet0 no wireless extensions.
MachineType: Gigabyte Technology Co., Ltd. EP45-UD3R
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-21-server root=UUID=078d51fd-f451-4c9d-94af-98d67453f0d2 ro quiet
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34
RfKill:

SourcePackage: linux
dmi.bios.date: 08/31/2009
dmi.bios.vendor: Award Software International, Inc.
dmi.bios.version: F11
dmi.board.name: EP45-UD3R
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.type: 3
dmi.chassis.vendor: Gigabyte Technology Co., Ltd.
dmi.modalias: dmi:bvnAwardSoftwareInternational,Inc.:bvrF11:bd08/31/2009:svnGigabyteTechnologyCo.,Ltd.:pnEP45-UD3R:pvr:rvnGigabyteTechnologyCo.,Ltd.:rnEP45-UD3R:rvrx.x:cvnGigabyteTechnologyCo.,Ltd.:ct3:cvr:
dmi.product.name: EP45-UD3R
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
CharlesA (charlesa) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Charles,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
CharlesA (charlesa) wrote :

I try using the kernel located here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33.1-lucid/linux-image-2.6.33-02063301-generic_2.6.33-02063301_amd64.deb

Tried installing the headers as well, but it failed with "missing dependencies."

As for right now, I am unable to even access GRUB, sitting at a blank screen with a blinking cursor.

Revision history for this message
CharlesA (charlesa) wrote :

After getting back into a workable system I backed up everything and then did a clean reinstall.

apt-get dist-upgrade to get the new kernel and I now have a logon prompt. I guess it might have been a fluke, or maybe caused by one of the packages I had installed.

Any way to narrow it down?

tags: removed: needs-upstream-testing
Revision history for this message
Andreas Ntaflos (daff) wrote :

I don't think this was a fluke.

Reading the description and the symptoms this seems exactly what we have been experiencing on three of our HP DL380 G6 servers. These three server are identical (modulo disk capacity) and serve as our test beds for Ubuntu Server 10.04 (freshly installed as of yesterday). After a standard upgrade (using aptitude) to the new linux-image-2.6.32-22-server package, which was the only package upgraded, the system would almost always hang during the boot sequence after displaying the messages about fsck/filesystem health:

fsck from util-linux-ng 2.17.2
fsck from util-linux-ng 2.17.2
/dev/mapper/TEST02-root: clean, 52298/2924544 files, 458703/11696128 blocks
/dev/cciss/c0d0p1: clean, 205/124496 files, 32770/248832 blocks

After that it just hangs. We can reboot the server by means of ctrl-alt-delete as well as connect through SSH, interestingly. No getty processes are spawned, so no console login. Also interestingly this doesn't happen on a HL DL120 G5, which has an almost completely different hardware configuration. Virtual machines (KVM-based, using the -virtual flavour) also don't seem affected by this.

Reverting back to 2.6.32-21-server (aptitude remove linux-image-2.6.32-22-server) seems to fix the problem. I am pretty certain that the problem lies with the -22-server package as I have tested and reproduced this on three identical servers, as stated above.

How can I help debug this further? We still have about one weeks to run experiments and tests on these servers so I'll gladly help where I can.

Revision history for this message
CharlesA (charlesa) wrote :

Those are the same symptoms that I experienced. Only difference is that I wasn't able to connect via SSH (the box is running Samba, BIND9, DHCP3 and SSH, and a few other stuff, virtualbox for one).

Ctrl+Alt+Del to force a reboot was recognized and halted all the processes and rebooted the machine.

Do those Dell servers run RAID?

The only thing I can think of is that on my server, I've got a 3 disk RAID-5 array (using a RocketRAID 2640X1 controller: http://www.highpoint-tech.com/usa/bios_rr2640.htm) that I need to build the drivers from source after each kernel upgrade - which had an entry in fstab. The main OS drive was a 320GB WD Caviar Blue, which wasn't set up for RAID.

What about Virtualbox? I had Virtualbox installed from the karmic repos over at Virtualbox.org.

Just reaching at straws here.

Also: I ended up doing a clean install of Lucid Server x64 and running apt-get dist-upgrade to get the new kernel before installing all the services. The upgrade went fine and I was able to get to a prompt after rebooting. After installing all the services, and rebooting, the RAID controller was recognized and everything loaded up without hanging.

I am quite puzzled.

Revision history for this message
Steven Schiebel (spschiebel) wrote :

I have had this same issue as well.

After it happened the first time, I did a clean install. I ran apt-get safe-upgrade to get the kernel updates. Installed mdadm and set up my raid 6 array without problems. Edited my /etc/mdadm/mdadm.conf and /etc/fstab files to automatically assemble and mount the array and did a reboot. The system booted up fine and the array was running. Next installed openssh-server and lvm2. Made no changed to ssh but created a PV on LVM2. I did not go any further than that as I wanted to continue my work via ssh from my desktop (server is headless). Did a reboot and it hung exactly as described above and Andreas post #5. I will try to revert to the previous kernel now and see if that brings me joy.

Revision history for this message
Andreas Ntaflos (daff) wrote :

This is becoming a bit scary. I have continued experimenting with the HP DL380 machines I have here and even now even the even with the kernel image 2.6.32-21-server makes the machine hang.

I can still SSH into it and look at the process list, see attachment. Does anything in there look suspicious? I notice that

/sbin/plymouthd --mode=boot --attach-to-session

is running, which does not run when the machine has booted completely. We also don't run VirtualBox but the basic Libvirt/KVM setup to host virtual machines. All machines have LVM configured and have their root and swap partitions as logical volumes.

Some more background info: the machines are HP ProLiant DL380 G6 (apparently the most popular ProLiant machines by HP ever). All three have Smart Array P410 RAID controllers running RAID1 on two disks. The drivers for the controller are in the kernel, no need for any manual compilation. Obviously there is no need for mdadm in this case. That said, I don't think this bug has anything to do with RAID setups.

What else can we provide? I am afraid that if this problem isn't solved soon we have to revert back to Ubuntu 9.10, which I would really like to avoid. The improvements made in AppArmor and Libvirt in 10.04 are very important for us, plus we would like to take advantage of the LTS.

I still have about a week to play with these machines, after that we need to get them production ready. I hope this bug gets the attention it needs. Please do tell if we can provide any more information.

Revision history for this message
CharlesA (charlesa) wrote :

Andreas Ntaflos:

Does the same hanging happen if you do a clean install, then upgrade to the -22 kernel?

Trying to see if the same workaround that worked for me, works for you.

Revision history for this message
Andreas Ntaflos (daff) wrote :

Charles, this is exactly what I did in the first place. Clean install of Ubuntu 10.04 Server, then upgrade to -22-server kernel. More often than not the system would not come up correctly, as described above.

Revision history for this message
CharlesA (charlesa) wrote :

Wish we had some direction as to what information was needed to help narrow this down. I was able to get mine back up and running after a clean install, but I am a bit reluctant to do any kernel upgrades for the time being, as I don't want to have to go thru that whole mess again.

Revision history for this message
Andreas Ntaflos (daff) wrote :

I just went through that whole mess again and reinstalled one of the machines. Unfortunately to no avail. Installation went speedy and fine, aptitude update and aptitude safe-upgrade as well (this includes installation of the new kernel image), rebooted three or four times and now it hangs again. It always looks like this: <https://daff.pseudoterminal.org/misc/hang01.jpg> and <https://daff.pseudoterminal.org/misc/hang02.jpg>. Both servers are identical in configuration and equipment.

I am running out of ideas. This seems to be extremely indeterministic and is becoming a royal pain in the ass. I really don't want to have to go back to 9.10.

Revision history for this message
jgreenso (james-green-mjog) wrote :

I think I just reported this bug too - if someone feels confident enough that we are all in the same boat please mark mine as a duplicate of this.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/588799

Revision history for this message
jgreenso (james-green-mjog) wrote :

Jeremy: You're asking those affected to test the latest upstream kernel build. I've looked, and I can only spot -generic builds.

Are you asking us to test this specifically, given we seem to be working fine with the current released version of -generic anyway? It would seem more reasonable to have a -server upstream build to test instead, unless I'm missing something..?

Revision history for this message
Andreas Ntaflos (daff) wrote :

Interesting, I hadn't tested using non-server kernel images. Thanks for the suggestion jgreenso, I'll be trying that on our problematic machines.

Revision history for this message
jgreenso (james-green-mjog) wrote :

Almost two weeks since the last comment, the status of this bug remains 'Incomplete'.

It would be good to know what tests are needed to make it 'Complete'. Presumably someone maintaining the packages may hold a clue as to what to look at?

Unfortunately our machine has now had to go into production mode using the -generic kernel package so we cannot use it for testing, and we're certainly not going to switch to the -server package without good reason.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

jgreenso,
   Apologies for the delay. Our process is to verify if there is a difference in behavior using a generic upstream build of the kernel with only minimal ubuntu configs added. This enables us to see if there is a commit from stable that resolves an issue and can be included in kernel updates. hence the reason for my request. I'll have to inquire about -server built upstream kernels as I do not think we currently build them in the same manner as our mainline kernels.

~JFo

Revision history for this message
CharlesA (charlesa) wrote :

Thanks for replying Jeremy.

I stopped messing with the machine after doing an apt-get dist-upgrade (after a clean install) and before installing any other packages caused the problem is not occur. So far, I haven't had any other issues since then.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Charles,
    I'm glad you were able to get it to a known good state. Are you still updating regularly. and is the machine still booting normally?

Thanks!

~JFo

Revision history for this message
CharlesA (charlesa) wrote :

Jeremy,

Yeah, it's completely up-to-date and has been booting fine. I just don't know why it had a problem when I went to do the kernel update when I had packages already installed.

It's been up for around 48 days total (ever since May 5th, when I did the clean install and dist-upgrade) with periodic reboots due to updates but it hasn't had any problems as of yet.

Charles

Revision history for this message
Andreas Ntaflos (daff) wrote :

I just want to add that I unfortunately have not been able to test this further (not even using -generic instead of -server kernels) since the machines in question needed to go into production. We went back to Ubuntu 9.10 for that. If I'm lucky I can maybe get one or two of these machines back in probably a few weeks to try again. Depends on how the project goes.

In the meantime I see that 2.6.32-22 is still the current version of the kernel image so I don't suppose anything changed here.

Charles, are you possibly able to reboot your machine a few times (maybe five or more) in a row and see if the problems really don't reappear? When I had been testing I found that the occasional reboot (a day or two apart) generally did not trigger the hang but a few consecutive reboots one immediately after the other would. This could of course be total coincidence or BS but I really haven't been able to find any patterns here. Grasping at straws, as it were.

Maybe of note: I also did pretty much exactly what you describe in comment #18, more than a few times: a clean install and then aptitude safe-upgrade (not dist-upgrade). Alas, a few reboots later the machine would hang again.

Unfortunately I cannot really contribute much more at this point.

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Revision history for this message
CharlesA (charlesa) wrote :

Andreas:

I rebooted a few times a while ago, within 30 minutes of each other, then 8 hours later without any problems.

Here's my uprecords so far:

charles@thor:~$ uprecords
     # Uptime | System Boot up
----------------------------+---------------------------------------------------
     1 19 days, 23:57:52 | Linux 2.6.32-22-server Wed May 5 21:31:57 2010
> 2 19 days, 00:46:21 | Linux 2.6.32-22-server Fri Jun 4 11:31:50 2010
      3 8 days, 06:56:28 | Linux 2.6.32-22-server Wed May 26 06:01:24 2010
      4 0 days, 22:32:16 | Linux 2.6.32-22-server Thu Jun 3 12:58:43 2010
      5 0 days, 08:29:56 | Linux 2.6.32-22-server Tue May 25 21:30:39 2010
      6 0 days, 00:32:34 | Linux 2.6.32-22-server Wed May 5 20:00:36 2010
      7 0 days, 00:06:52 | Linux 2.6.32-22-server Wed May 5 20:57:21 2010
      8 0 days, 00:06:48 | Linux 2.6.32-22-server Wed May 5 20:33:49 2010
      9 0 days, 00:06:15 | Linux 2.6.32-22-server Wed May 5 21:12:12 2010
    10 0 days, 00:01:20 | Linux 2.6.32-22-server Wed May 5 20:45:42 2010
----------------------------+---------------------------------------------------
no1 in 0 days, 23:11:32 | at Thu Jun 24 11:29:41 2010
    up 48 days, 15:37:35 | since Wed May 5 20:00:36 2010
  down 0 days, 00:40:00 | since Wed May 5 20:00:36 2010
   %up 99.943 | since Wed May 5 20:00:36 2010

Have you tried using apt-get dist-upgrade instead of aptitude safe-upgrade? I don't know if there is any real difference between the two. Judging from the manual pages, it seems like apt-get dist-upgrade does the same thing as aptitude full-upgrade, but I'm not 100% sure of that.

Revision history for this message
Andreas Ntaflos (daff) wrote :

Charles, thanks for the reply. I do not use dist-upgrade for regular package upgrades. However there should not be any difference between dist-upgrade (full-upgrade) and safe-upgrade in this case. The kernel package gets updated either way.

So you don't seem to see any problems even after a few consecutive reboots. I guess that's good to know (good for you!) but since nothing has changed, package-wise, in the last month-and-a-half I am not sure what to make of it.

Jeremy: Fix Released? Is this an error or is there really a solution to this problem? Where can I learn more about it?

Revision history for this message
CharlesA (charlesa) wrote :

Ah I see. I usually just use apt-get upgrade for regular package upgrades and dist-upgrade to get new kernels. I've not really used aptitude much, if at all. Thanks for the info on that.

Only thing that I know is that it worked when I did a dist-upgrade after a clean install and without installing any additional packages. Failed when I did the dist-upgrade after I installed the other packages. Could be a fluke, I don't know, since there are a couple other people who were affected the same way I was.

If you are able to get a machine installed with 10.04, maybe try doing an apt-get dist-upgrade after doing a clean install and see what happens. That's the only thing I can think of that might be worth a shot.

Revision history for this message
Andreas Ntaflos (daff) wrote :

Charles, I'll be trying your idea as soon as I get one of the machines back (next week probably). Maybe I can find some pattern :)

But I really don't think this is in any way "fixed". All we know is that there are at least five people affected by it and no solution has been posted or linked to other than "use a different kernel" and (paraphrasing) "it seems to be working now, but I have no idea why". One is a workaround and the other is basically ... nothing. Having this bug marked as "Fix Released" is either a mistake or some arbitrariness that I can not comprehend.

I've spoken to a colleague who also runs HP servers and he experienced apparently the exact same issue. The difference is that he decided that 10.04 is simply not working for them and that they'll check back in six months. Didn't even start investigating anything. So please, Jeremy or anybody, tell us what we can do to help debug this issue while we still have the possibility to do so.

Sorry if I sound desperate. I somewhat am.

Revision history for this message
jgreenso (james-green-mjog) wrote :

Andreas, have you tried a -generic kernel build? That's what's working on our two 10.04 servers, the -server kernel causes the problem for us.

Changed in linux (Ubuntu):
status: Fix Released → Incomplete
Revision history for this message
jgreenso (james-green-mjog) wrote :

I've changed the status given there's no evidence that the problem has been identified yet alone a fix released.

Revision history for this message
Andreas Ntaflos (daff) wrote :

jgreenso, I will try a -generic kernel image hopefully next week or the week after that, when I get my hands on one of the machines on which we experienced the problems.

Revision history for this message
CharlesA (charlesa) wrote :

I just stumbled across a thread that mentioned having Ubuntu hang if there is something other than /proc, / and swap were in fstab: http://ubuntuforums.org/showthread.php?t=1480113

I don't know if that was the case for me, since I added my RAID array to fstab before I upgraded to the new kernel the first time, but afterwords I just upgraded to the new kernel before adding anything to fstab. Could be a coincidence, but maybe worth checking out.

Revision history for this message
jgreenso (james-green-mjog) wrote :

We have USB-attached hard drives so that would add another line to fstab. Does anyone know if the -server build has behaviour changes when it comes to fstab mount points?

Revision history for this message
CharlesA (charlesa) wrote :

I did a clean install using the x64 version of Lucid, kernel 2.6.32-31. The only thing I did before upgrading to the new kernel (2.6.32-32) was build the drivers for my RAID card (Highpoint Technologies RocketRAID 2640x1) and added an entry to fstab to mount the array on boot.

It's now sitting at the same thing it was in the original bug report: fsck /dev/sda1 clean...

I booted into the previous kernel and commented out the fstab entry for my RAID array and rebooted into the new kernel.

I was met with a logon prompt, as expected.

I hooked up an external USB hard drive to the machine, added an entry to fstab and rebooted. It came up fine.

Disconnected the USB drive and rebooted, it hung again. I tried rebooting into the previous kernel with the drive disconnected and it hung at fsck /dev/sda1 clean.. until I hit "s" then it continued to boot and came up as normal.

Booted into the new kernel and hit "s" even though it was just sitting at a blinking cursor, and arrived at a logon prompt.

Did the same after uncommenting the RAID volume, same thing happened. After hitting "s", I got "skipping /array at user's request" that device is set to be mounted on /array in fstab.

My guess is that it's not showing the press "s" to skip, etc if is cannot find a device listed in fstab.

It sounds like there is something causing the OS to not display if there is an entry in fstab, but the disk doesn't exist.

There was a thread about a similar thing happening on Ubuntu desktop, except they were prompted to hit "s" or "m"
http://ubuntuforums.org/showthread.php?t=1520413

Also, I just wanted to note: I went thru many, kernel upgrades on Karmic without having to comment out the fstab entry in order for the machine to boot. All I needed to do was build the drivers after I booted into the new kernel and then reboot and all was well.

I hope this helps; it sounds like the -server kernel handles mounting fstab differently somehow. I don't know.

Revision history for this message
CharlesA (charlesa) wrote :

Tried the same test with Lucid Desktop i386, and the same thing happened, except that it prompted to hit "s" or "m."

See attachment.

I don't really want to have to comment out any entries in fstab before I upgrade the kernel, especially if I don't have physical access to the machine to hit "s" if I forget.

Revision history for this message
CharlesA (charlesa) wrote :

Upgraded to 2.6.32-24-server after commenting out the RAID array in fstab, rebooted into the new kernel and built the drivers, then mounted the array without any problems.

Is it a bug that there is no error message displayed when the OS cannot find a device listed in fstab?

Revision history for this message
Andreas Ntaflos (daff) wrote :

Charles, that's definitely a bug. I believe there have been efforts in Plymouth to fix it but you need to the splash screen for those error messages to be displayed. A default server install, or one where the kernel is instructed to boot with as much text output as possible (GRUB_CMDLINE_LINUX_DEFAULT="text" in /etc/default/grub), apparently has no way to display such messages.

However, that seems to be a completely different issue altogether. In my experiences posted above there were no special or additional entries in /etc/fstab whatsoever and even pressing "s" or "m" when the server hung did not do anything. It was just a default install of the Ubuntu server edition.

On a related note: I've now finally gotten back two servers of the kind that had the problems described earlier in this thread. I will test them extensively this week. It also seems that running a -generic kernel really helps. One machine that had to go into production also runs a -server kernel but it is 2.6.32-23-server (not -22-server) and on the (very few) reboots I did on it no hangs would occur. So maybe this has fixed itself but I'll know more after this week.

Revision history for this message
CharlesA (charlesa) wrote :

Thanks Andreas. I've filed a bug report about the "no error message being displayed" part.

Keep us updated. :-)

I'm just glad I figure out why my server would appear to hang, and I hope that the problem has been fixed for you.

Revision history for this message
Wes Janzen (wes-janzen) wrote :

I'll just add that this is also a problem for me on an HP DL380G6 on 10.04 LTS server. Typically I eventually get in after several reboots, although I'm having trouble on this last go round.

Revision history for this message
Tom Marcoen (tom-azmei) wrote :

I experienced the same problem. I have an the following server configuration:

HP DL380 G3
2 x Intel Xeon 2.8GHz
6GB RAM
6 x 36.4GB SCSI in a RAID5 array (no spares)

I did a standard installation (checked the OpenSSH server so I installed software) and did an apt-get update and apt-get upgrade. I rebooted the server and it hung giving almost the exact same output.

I came from the server room to google the problem and found this thread. I just went to the server room again to test pressing "m" or "s" as mentioned here, but when I arrived in the server room, the screen was purple, the text "Ubuntu 10.04" was showing and there were 4 white turns turning red and white again. After pressing Ctrl+Alt+Del several times, the server rebooted.

Now I got a login prompt. I logged in and immediately entered "sudo init 6" en after rebooting, I got the same error again. I'll see if I can test a general kernel image tomorrow but I'm not sure I'll be able to find the time for it.

Revision history for this message
jgreenso (james-green-mjog) wrote :

Which version of the -server package do you have? And have you yet performed a dist-upgrade to get the latest -server packages?

Revision history for this message
Andreas Ntaflos (daff) wrote :

Chiming in again I am cautiously optimistic that with the latest -server packages (-23 and recently -24) the problem went away. I have installed and upgraded two identical DL380 G6 servers and performed many consecutive reboots since and have yet to experience any hangs. I just rebooted one of the machines three times in a row, no problem. My hopes are high.

Revision history for this message
Tom Marcoen (tom-azmei) wrote :

First I had the 2.6.32-21-generic-pae and now, after an dist-upgrade to 2.6.32-24-generic-pae the problem seems to be solved. I'll continue rebooting the server to see if it keeps coming up or crashes again.

Revision history for this message
Thomas Pedersen (tpede08) wrote :

Experiences the exact same problem here, with an clean install + apt-get dist-upgrade. Seem to have problems with both the 2.6.32-21-server and 2.6.32-24-server. However using the 2.6.32-24-generic seems to work fine, as others described. The server is also the HP Proliant DL380 G6. Also have a HP Proliant DL360 G5 around, however it seem to boot fine without any problems. The server will be in production in less than a week, so I hope someone found the solution to this problem? As long as the server is not in production, I can run tests on it, in case you have any suggestions?

.. btw sometimes I am able to boot from those kernel, but mostly I cannot.

Revision history for this message
CharlesA (charlesa) wrote :

Thanks for the info Thomas. I wonder what the difference between the -server and the -generic kernels are that could be causing the machines to hang.

Revision history for this message
Andreas Ntaflos (daff) wrote :

This bug is not a duplicate of #571444. At least not in the characteristic that I and others who reported here have experienced it. My /etc/fstab was always in the pristine state that is the result of a fresh installation so there were no filesystems for which automounting could have failed.

Additionally, as reported in #571444, pressing 'S' or 'M' when the system appeared to hang continued and finished the boot sequence correctly. Not so in the case of this bug, where nothing but Ctrl-Alt-Delete would have any effect on the hung system.

So this is definitely not a duplicate of #571444. The good news is that the problem was apparently isolated to the linux-image-2.6.32-22-server package, i.e. the server kernel image in version 2.6.32-22. We are now at 2.6.32-24 and I have several Ubuntu Server 10.04.1 machines in production, all working fine, even with regularly performed reboots. The same machines would have the problem described at length by us in this bug report.

Thomas, do you still experience any hangs during boot with a current -server image package?

Revision history for this message
CharlesA (charlesa) wrote :

Thanks for the update Andreas. Glad to hear that the 2.6.32-24 server kernel isn't causing any more problems for you.

Revision history for this message
Thomas Pedersen (tpede08) wrote :

I agree with Andreas, the fstab setup we are using here, is the default, and only contains local disks. Further more I see no such error message as "mountall: Filesystem could not be mounted: ...".

We still have a server or two offline, due to lack of networking, so I will try to see if I can sneak in the server image package on one of them, and see if it works -- but I cannot do anything before monday.

Revision history for this message
Fionn (fbe) wrote :

Do you happen to have /var on a separate partition?

If yes, try moving the /var entry in fstab up to the top and let us know if it helps.
I suspect this has to do something with like ureadahead in initialisation mode and /var not being available for ureadahead data storage or so.

Revision history for this message
Fionn (fbe) wrote :

Oh, well, for the records:
We encountered this problem with linux-image-2.6.32-24-server.
It seems to have been fixed by moving /var up to the top in fstab.

Revision history for this message
Fionn (fbe) wrote :

See also: bug #638228

Revision history for this message
Andreas Ntaflos (daff) wrote :

Fionn, glad you solved your problem, but it is not this particular problem.

I want to remind everyone that any kind of mounting problems have nothing to do with the issue this bug report describes, namely that at least one specific linux-image package (2.6.32-22-server) caused the boot process to hang in an early stage. Problems with mountall, missing console messages and separate partitions in /etc/fstab do not factor in at all here.

This is evident by the facts that a) /etc/fstab contains no additional entries that could fail to mount and b) pressing 's' or 'm' does not result in the boot process continuing regularly.

So please, someone who has the power to do so, unmark this bug as a duplicate of #571444 or any other mountall/plymouth related bug.

Revision history for this message
Fionn (fbe) wrote :

Andreas, please note that I did not report a mount problem but (suspectedly) some sort of a race condition.
Pressing 's' or 'm' did NOT help in our case either! - Presumably because ureadahead hung itself up when trying to write data before storage was available.

Revision history for this message
Andreas Ntaflos (daff) wrote :

Ah, I see. I must have misunderstood your post, sorry. It is interesting and disturbing that there seem to be so many conditions that could result in a server not booting correctly.

Revision history for this message
Fionn (fbe) wrote :

We had the issue with two more servers today, one of them did not have a separate /var partition.
The real reason for the immediate death of the boot process was /etc/default/locale which contained the line

LC_ALL

without further assignment. This led to the following boot message:

[ 1.779805] kjournald starting. Commit interval 5 seconds
[ 1.779828] EXT3-fs: mounted filesystem with ordered data mode.
Begin: Running /scripts/local-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
Done.
/etc/default/locale: 9: LC_ALL: not found
init: mountall main process (311) terminated with status 127

With the default quiet / splash on you end up with a black screen and nothing happens. Pressing keys does not help. So this is probably a totally different problem than the one originally reported. (And I have no idea where to file this properly)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.