failed kernel update followed by boot failure

Bug #107249 reported by Dwayne Nelson on 2007-04-17
0
Affects Status Importance Assigned to Milestone
lilo (Ubuntu)
Undecided
Unassigned

Bug Description

I installed feisty from the 4/14/07 alternate CD (to support software RAID) and was able to boot thanks to the fix to bug #103177.

The system worked fine for a day or so - then the update manager provided kernel 2.6.20-15, which was not able to install (not sure why). Apparently the failed install ruined my boot setup / LILO and now I am unable to boot. When I try to boot, I see the following message:

  "EBDA is big; Kernel setup stack overlaps LILO second stage"

System is configured as follows:

4 SATA drives:
  1x74GB containing windows XP;
  3x250GB configured as 2 RAIDs:
    md0: RAID1: 10GB:
      /
    md1: RAID5: 480GB: LVM:
      /home
      /swp
      /tmp

My guess is that the updater didn't understand how to make the changes to LILO on the RAID1 volume. The "recovery" install apparently does not know how to do this either because I am not presented with my md0 partition when I try to fix it - instead, I get only home, swp, and tmp (from md1) and the component partitions for the RAIDs and the XP drive.

Because I use LILO rather than GRUB (I think this was necessary to boot from RAID), I do not have any boot-menu options. I do not know of any other way to boot and I would like to avoid wiping my md0 volume because it (hopefully still) has important information in /etc/openvpn, /etc/apache2, and /var.

Dwayne Nelson (edn2) on 2007-04-18
description: updated
Dwayne Nelson (edn2) on 2007-04-18
description: updated
Dwayne Nelson (edn2) wrote :

I expect to install the release version of Feisty tomorrow evening - I've decided that having a running system is far more important than preserving the configuration data stored on /etc/openvpn, /etc/apache2, and /var (addressing the repair/restore install issue). However, I still fear that the update manager will break my system again the next time it tries to install a new kernel.

Michael Vogt (mvo) wrote :

Thanks for your bugreport.

Can you please attach the files in /var/log/dist-upgrade ? This will help us diagnose the problem.

Thanks,
 Michael

Changed in update-manager:
status: Unconfirmed → Confirmed
status: Confirmed → Needs Info
Dwayne Nelson (edn2) wrote :

Right now I don't have access to md0 which contains that directory. If I could install RAID support after booting from the live CD, then I might be able to access these files.

Michael Vogt (mvo) wrote :

Installing RAID support on the live CD should work, no? I'm not a expert for RAID, but the livecd does allow installing of packages, changing sources.list etc.

Dwayne Nelson (edn2) wrote :

the /var/log/dist-upgrade directory is empty.

BTW: mounted md0 by the following steps:
  installed mdadm (not positive this was necessary)
  created directory /mnt/md0
  typed "sudo mount -t auto /dev/md0 /mnt/md0"

Dwayne Nelson (edn2) wrote :

I have re-installed using the release version.

I did save all of the contents from my /var/log directory so any files that would have been there are still available.

Also - I recovered my firefox session where I was searching for an answer after the upgrade failure and before the reboot. The string I had been searching on was:

  "Fatal: First sector of /dev/sdc1 doesn't have a valid boot signature"

I note that sdc1 (not sure which drive it was referring to) shouldn't have had a boot signature as I was booting from md0.

Dwayne Nelson (edn2) wrote :

The latest kernel update worked fine so I am assuming that the problem no longer exists. This report should be closed.

Dwayne Nelson (edn2) wrote :

With the last kernel update, I think I've encountered the problem again. My guess is that this machine will not be able to reboot so I hope to keep it running for long enough to provide any log files that might be related. To start with, I'm attaching a screenshot which shows the error message.

Dwayne Nelson (edn2) wrote :

i think this has something to do with my RAID, for whatever reason. "cat /proc/mdstat" produces the following output, suggesting md0 is not fully functional.

  Personalities : [raid1] [raid6] [raid5] [raid4]
  md1 : active raid5 sdc2[0] sdb2[2] sdd2[1]
        468856832 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

  md0 : active raid1 sdd1[1] sdb1[2]
        9767424 blocks [3/2] [_UU]

  unused devices: <none>

the following URLs seemed to support this theory:

  http://www.issociate.de/board/post/28744/lilo_boot_problems.html
  http://wpkg.org/index.php/Software_RAID_in_Linux

Dwayne Nelson (edn2) wrote :

So I added sdc1 back to the md0 array. This was enough to get me past the errors in the upgrade and everything checks out ok. I guess I'll really know when I next try to reboot.

It is not clear to me why sdc1 was dropped from the array ... I will look for a log to see if I can find an explanation. sdc2 remained in the second array so I know that the drive itself is functional.

here is the output from apt-get install:

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libgoffice-0-common gdesklets-data gnome-cups-manager
Use 'apt-get autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
Need to get 0B of archives.
After unpacking 0B of additional disk space will be used.
Setting up linux-image-2.6.20-16-generic (2.6.20-16.29) ...
Running depmod.
update-initramfs: Generating /boot/initrd.img-2.6.20-16-generic
Warning: '/proc/partitions' does not match '/dev' directory structure.
    Name change: '/dev/dm-0' -> '/dev/horizon/swp'
Added Linux *
Added LinuxOLD
Added Windows
The Master boot record of /dev/sdc has been updated.
Warning: /dev/sdd is not on the first disk
The Master boot record of /dev/sdd has been updated.
Warning: /dev/sdb is not on the first disk
The Master boot record of /dev/sdb has been updated.
Not updating initrd symbolic links since we are being updated/reinstalled
(2.6.20-16.28 was configured last, according to dpkg)
Not updating image symbolic links since we are being updated/reinstalled
(2.6.20-16.28 was configured last, according to dpkg)
You already have a LILO configuration in /etc/lilo.conf
Running boot loader as requested
Testing lilo.conf ...
Testing successful.
Installing the partition boot sector...
Running /sbin/lilo ...
Installation successful.

Dwayne Nelson (edn2) wrote :

The latest kernel update has failed. This time the message that I receive is "Fatal: First sector of /dev/sda1 doesn't have a valid boot signature". This does not appear to have anything to do with the RAID - mdstat shows all volumes are up:

  Personalities : [raid1] [raid6] [raid5] [raid4]
  md1 : active raid5 sda2[0] sdd2[2] sdb2[1]
        468856832 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

  md0 : active raid1 sda1[0] sdd1[2] sdb1[1]
        9767424 blocks [3/3] [UUU]

  unused devices: <none>

I am not sure why the drive devices keep showing up in a different order - note that last time sdc, sdb, and sdd were listed - this time sda, sdd, and sdb are listed. I hope I don't have to look forward to a different problem every time the kernel is updated ...

Dwayne Nelson (edn2) wrote :

Over the weekend some process caused me to need to reboot the machine - which is a problem because the machine "doesn't have a valid boot signature" as of the last kernel update. This is the third time a failed semi-automatic kernel update has left the computer unusable (I typically accept kernel updates because I'm hoping one of them will address one of the many USB/scanner problems). Perhaps the update-manager should confirm the update could be completed before actually writing to the drive? I am now in the same situation I was in on April 17th (can't boot) ... please let me know if there are any other log files I should attempt to pick up before I re-install with Gutsy.

Dwayne Nelson (edn2) wrote :

Problem solved (again)

I used the liveCD from Gutsy to boot, and re-assembled the array using info from this page:

  http://ubuntuforums.org/showthread.php?t=408461

mounted the volume and then ran lilo following some information I found here:

  http://www.syrlug.org/contrib/boot-loaders.html#LILO

and finally rebooted into Feisty - where the failed kernel update repeated itself. Fortunately, I could still work on the problem because the system hadn't been rebooted.

This time, I solved the problem by removing a section of my lilo.conf file corresponding to a windows boot. I am not sure why the windows section was there (perhaps as a courtesy?) because I didn't add it (when I want to boot from windows, I just select the windows boot device from BIOS). In fact, I have never touched the lilo.conf file before today.

In any case, I think the lilo configuration was no longer valid because the order of the drives (as reported by BIOS?) was changed between kernel updates - possibly relating to an issue documented in bug #8497.

Anyway, good news today. So until next time - hope NOT to see you so soon.

Launchpad Janitor (janitor) wrote :

[Expired for lilo (Ubuntu) because there has been no activity for 60 days.]

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers