LKCD Not Executing kexec Properly

Bug #710733 reported by Joseph Salisbury
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Oneiric
Invalid
High
Canonical Kernel Team

Bug Description

I'm attempting to use linux-crashdump to debug an issue. I've been following the documentation at:

https://wiki.ubuntu.com/Kernel/CrashdumpRecipe

The exact steps I've done are:
Installed linux-crashdump:
sudo apt-get install linux-crashdump
Rebooted system to enable crashdump.

My test to force a crash:
echo 1 | sudo tee /proc/sys/kernel/panic_on_oops
echo c | sudo tee /proc/sysrq-trigger

However, I no files are ever generated in /var/crash. In fact the /var/crash directory didn't exist until I created it. After executing echo c | sudo tee /proc/sysrq-trigger the system locks up. There is a stack trace on the console, but I have not been able to get the system write this trace to a file. I captured some screen shots, which are attached. I also attached an "alt+sysrq t" trace.

I am able to execute kexec manually, and the system will reboot:
/sbin/kexec --command-line="BOOT_IMAGE=/boot/vmlinuz-2.6.37-11-generic root=UUID=16a635bc-7110-4c13-97bf1a3bb5931a96 ro vt.handoff=7 quiet splash irqpoll maxcpus=1 nousb" --initrd=/boot/initrd.img-2.6.37-11-generic /boot/vmlinuz-2.6.37-11-generic

I've tried this on Lucid, Maverick and Natty using both KVM VMs and physical machines(AMD Based, and Intel based. All these tests fail with the same results(The system flocks up and not crashdump data is generated).

For a consistant test case, I am using a netbook, but I can also reproduce this on a server if that is preferred. The netbooks cpu is: Single CPU: Intel(R) Atom(TM) CPU N455 @ 1.66GHz

I have also tried booting with nosmp, with no change.

I will attach the output from ubuntu-bug to this report.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Brian Murray (brian-murray) wrote :

I haven't tried to recreate this yet myself but looking at the bug it occurred to me that apport is disabled for all those releases. Could you make sure apport is on by editing /etc/default/apport? Thanks in advance.

Changed in linux-meta (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Brian

Thanks for the suggestion. I enabled apport on my 10.04 server. I forced the crash. The server seemed to reboot extremely fast. However, there was not file written to /var/crash. If I reboot and try perform the crash again, I get the same results.

I tried forcing a crash again after the very fast reboot(Without rebooting after the first crash attempt). I get a stack trace on the console, which I will attach. The file name is LucidCrashdumpConsole.jpg

Here is the crashdump specific info from dmesg:
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-2.6.32-27-server root=/dev/mapper/dl3802--1-root ro crashkernel=384M-2G:64M,2G-:128M quiet
[ 0.000000] Reserving 128MB of memory at 32MB for crashkernel (System RAM: 4863MB)
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-2.6.32-27-server root=/dev/mapper/dl3802--1-root ro crashkernel=384M-2G:64M,2G-:128M quiet

I also enabled apport on my netbook, which is running Natty. I get a little better results. I force the crash, and the system reboots, but takes a long time. It appears this is because its generating a crash file. It eventually hangs and I have to power cycle it. Looking in the /var/crash directory I get a directory with the timestamp. But then in that directory, I see a file named dump-incomplete.

I will test further and see what else I can find.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I continued testing with my netbook running Natty. I doubled the size of maxsize=209715200 to maxsize=419430400. This allowed the dump file to be generated! Strange because this size of the dump file is ~190Mb, so the maxsize of 200Mb should have been enough. I'll continue to play with it to understand the sizes. Not sure, but maybe a second bug should be since the system hangs if maxsize is not set high enough?

I have not tested crashdump on Lucid yet, but I'll also try playing with maxsize there. However, I don't feel hopeful since crashdump caused a stack trace there instead of hanging.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

BTW, I should have been clearer. I doubled the size of maxsize in the /etc/default/apport file.

Revision history for this message
Brian Murray (brian-murray) wrote :

Actually maxsize is in /etc/default/apport is not used and has been removed from that file. I've been unable to generate a crash report at all using Natty.

Changed in linux-meta (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Interesting. So maybe there is an issue that causes intermittent failures on natty. I was able to get it to work a couple of times after increasing maxsize. I'll try setting maxsize back to what it was and see if I can get a crash report. I'm still unable to get a crash report on Lucid server.

One other note, I am unable to get a crash reports on any release, including Natty, when using a KVM VM. I can only get successful crash reports on physical machines.

Revision history for this message
Brian Murray (brian-murray) wrote :

For the life of me I haven't been able to get a crash dump at all. After I echo c to sysrq-trigger the system locks up but never reboots. Additionally, after I manually reboot it apport tells me nothing and there is no crash file in /var/crash.

Changed in linux-meta (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Brian Murray (brian-murray) wrote :

Bug 599601 is likely a duplicate of this.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I performed some more testing today on a Natty desktop. I'm able to generate a crash dump. However, I've been getting intermittent failures. Like you mention, changing maxsize doesn't seem to help. It was just a coincidence that crash dump worked for the first time, after I increased maxsize. In all the failures, the system hangs performing the following during the dump file creation:

"Copying data : [N%] <- The percentage when the hang happens varies.

I had to perform some steps in addition to what's listed on the CrashdumpRecipe wiki. To get crash dump working(Although intermittently), I performed the following:

1. Installed linux-crashdump and kdump-tools.
 - Should it be necessary to install kdump-tools? Without kdump-tools, I see the following in /var/crash/vmcore.log:

"/root/usr/bin/makedumpfile: error while loading shared libraries: libdw.so.1: cannot open shared object file: No such file or directory"

 - I noticed makedumpfile lives in /usr/bin/ and not /root/usr/bin.
 - I tried creating a sym link in /root/usr/bin to point to the real makedumpfile in /usr/bin, but I still got the same error.
 - I performed an ldd on makedumpfile in /usr/bin, and all the libraries where found.
 - Again, I tried these things before I installed kdump-tools. Once kdump-tools is installed, the lib load error goes away.

2. I manually created the /var/crash directory.

3. Edited /etc/default/apport; Changed enabled from 0 to 1.

4. Edited /etc/default/kdump-tools:
 - Changed USE_KDUMP from 0 to 1.
 - Uncommented: #KDUMP_SYSCTL="kernel.panic_on_oops=1"
 - Without kdump-tools installed, this file doesn't exist.

5. Edited /etc/default/kexec. Changed LOAD_KEXEC from false to true, but this didn't seem to make a difference.

6. Removed 'quiet splash' from the boot parameters(So I could see where it was hanging).

To trigger a panic, I perform:
echo c | sudo tee /proc/sysrq-trigger

Revision history for this message
Peter Petrakis (peter-petrakis) wrote :

Concerning:

https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/710733/+attachment/1826111/+files/LucidCrashdumpConsole.jpg

Helps if I actually look at the code:

http://lxr.linux.no/linux+v2.6.37.2/drivers/tty/sysrq.c#L128

 128static void sysrq_handle_crash(int key)
 129{
 130 char *killer = NULL;
 131
 132 panic_on_oops = 1; /* force panic */
 133 wmb();
 134 *killer = 1;
 135}

There's your null ptr deference :), I had assumed that it
simply calls panic. From this point the panic notifier chain
should be activated and the kexec'd kernel goes to work.
If we're not getting out of the panic handler, that's
a separate issue.

http://lxr.linux.no/linux+v2.6.37.2/kernel/panic.c#L59

NORET_TYPE void panic(const char * fmt, ...)
{
 static char buf[1024];
 va_list args;
 long i, i_next = 0;
 int state = 0;

 /*
  * It's possible to come here directly from a panic-assertion and
  * not have preempt disabled. Some functions called from here want
  * preempt to be disabled. No point enabling it later though...
  */
 preempt_disable();

 console_verbose();
 bust_spinlocks(1);
 va_start(args, fmt);
 vsnprintf(buf, sizeof(buf), fmt, args);
 va_end(args);
 printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
#ifdef CONFIG_DEBUG_BUGVERBOSE
 dump_stack();
#endif

 /*
  * If we have crashed and we have a crash kernel loaded let it handle
  * everything else.
  * Do we want to call this before we try to display a message?
  */
 crash_kexec(NULL);

 kmsg_dump(KMSG_DUMP_PANIC);

We can screw up in multiple ways now, the kexec kernel
could have been loaded incorrectly. We could have a
good kexec kernel, but the attempts to prepare the HW
to boot the new kernel fail silently or hang.

There's a debug mode for kexec that might be worth enabling,
the switch is '-d'. However, since it seems you can
actually boot the kernel, and get to the initramfs
environment, it seems to me that there's something
wrong with the tools, supporting scripts, or you're
simply out of space.

BTW, what sort of HW is this?

Changed in linux-meta (Ubuntu):
importance: Undecided → High
Revision history for this message
Daniel Richard G. (skunk) wrote :

I'm trying to get crash dumps working, and there's a number of showstoppers going on (in current Natty).

First, the default crashkernel=... memory allocation for systems with up to 2GB RAM isn't even enough for the Ubuntu kernel: bug #785394

Second, the linux-crashdump package depends on "makedumpfile" instead of "makedumpfile-static". The former dynamically links against a whole bunch of stuff, and fails horribly in the initramfs environment ("error while loading shared libraries: libdw.so.1"). I'm guessing the reason for this is that makedumpfile-static only came into being in Natty.

Third, the makedumpfile-static package has an initramfs hook (/usr/share/initramfs-tools/hooks/makedumpfile) that copies the static makedumpfile binary into the initrd, but this is useless, because kexec-tools' /usr/share/initramfs-tools/scripts/init-bottom/0_kdump unconditionally uses the dynamic /usr/bin/makedumpfile on the mounted root partition! Bug #785425

When I address these issues, I am able to get a proper crash dump:

# ls -l /var/crash/
total 91660
-rw-r--r-- 1 root root 93760303 2011-05-19 17:47 linux-image-2.6.38-8-generic.0.crash

Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Sudhakar Jha (sudhakar-jha000) wrote :

@ Daniel Richard G.

I performed following steps to resolve the crashdump hang issue.

[Issue 1]default crashkernel=...
Changed the crashkernel to use 128M instead of 64M in /etc/grub.d/10_linux.
dmesg displays 128M reservation as expected.

[Issue 2] makedumpfile throws "error while loading shared libraries: libdw.so.1
cp /usr/bin/makedumpfile /root/makedumpfile_org
cp /bin/makedumpfile-static /usr/bin/makedumpfile

Even now the system is unable to generate core file. It's hung.
Please let me know if I missed any step.

Revision history for this message
Daniel Richard G. (skunk) wrote :

@Sudhakar,

I believe you'll need to regenerate the initrd with e.g. "update-initramfs -k all -u", so that the new makedumpfile binary is copied into it.

Revision history for this message
Louis Bouchard (louis) wrote :

This will be the subject of an upcoming bug when I'm back from vacation, but I have been able to highlight a clear problem : when /boot is on a separate partition, which is done by default when LVM is used at install time, then crashdump will fail, since the /boot is empty.

A workaround is possible by doing the following :

# umount /boot
# mount /dev/sda1 /mnt
# cp -pr /mnt/* /boot
# mount /boot

If not hit by the non-staticly linked issue of makedumpfile, the crash dump should work. I confirmed the issue on Lucid & Natty. I will post the bug number here as soon as it gets created.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Joe,

This has been nominated against Oneiric but it does not appear to have been tested with Oneiric yet. Care to test and confirm this is still an issue? Thanks.

Changed in linux (Ubuntu Oneiric):
status: Confirmed → Incomplete
Revision history for this message
Daniel Richard G. (skunk) wrote :

Yep, still doesn't work in Oneiric. The crash kernel can't boot due to insufficient memory, kexec-tools still pulls in a dynamically-linked makedumpfile(8)... has *anyone* worked on this at all?

Changed in linux (Ubuntu Oneiric):
status: Incomplete → Confirmed
Brad Figg (brad-figg)
tags: added: rls-mgr-o-tracking
Revision history for this message
Kate Stewart (kate.stewart) wrote :

per 20110916 release meeting: not oneiric release critical, removing tracking tag.

tags: removed: rls-mgr-o-tracking
tags: added: rls-mgr-o-tracking
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Bah, assigning to linux-crashdump forces it to linux-meta which is not what I wanted. Moving back to linux for now.

affects: linux (Ubuntu Oneiric) → linux-meta (Ubuntu Oneiric)
affects: linux-meta (Ubuntu Oneiric) → linux (Ubuntu Oneiric)
Revision history for this message
Brandon Heller (brandon-heller) wrote :

On oneiric I have the same problem, even after doing the same steps as Sudhakar. On a VM w/VMware fusion I can occasionally get a crash dump (one in three or four times), but on a native hardware install I haven't gotten a crash dump yet.

Revision history for this message
Louis Bouchard (louis) wrote :

@brandon

LP bug 785425 and LP bug 828731have fixed most of the issues with kdump. Changes are scheduled to be retrofitted from Precise to the older versions.

The only thing I have found while testing all this was when the system has >= 2Gb. This means that the crashkernel= clause will limit the reserved memory to 64M which seems too small. One workaround is to increase it to 128Mb which fixes the issue.

I'll see if I can revive this bug so this last thing gets adressed.

Revision history for this message
Daniel Richard G. (skunk) wrote :

Technically, the 64MB-too-small bug is #785394, but whatever it takes to get this done!

Revision history for this message
Brandon Heller (brandon-heller) wrote :

@louis

I'm still unable to trigger a crashdump, and my crash screen looks identical to that of the original bug reporter's.

I believe there's enough mem for the crashdump:

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.0.0-15-generic root=/dev/mapper/nfcm7-root ro crashkernel=384M-2G:128M,2G-:128M

I already had done:
sudo apt-get install linux-crashdump
cp /usr/bin/makedumpfile /root/makedumpfile_org
cp /bin/makedumpfile-static /usr/bin/makedumpfile
update-initramfs -k all -u
sudo update-grub
mkdir /var/crash
[rebooting]

I also installed the backported fix to kexec-tools from bug 828731.

I'm not sure what else to do from bug 785425.

Louis, do you have any suggestion of what to try next? I've attached a screenshot. Thanks.

Revision history for this message
Brandon Heller (brandon-heller) wrote :

I ran through the steps on an identical machine and realized that update-initramfs was leaving untouched the kernel I was using (3.0.0-15-generic). Went back to the original machine, switched to a different kernel (3.0.0-12-server), and can reliably generate a crash file there. Other kernels refuse to reboot automatically, though.

Do you know what's required to get a crash dump with a different kernel? In /boot, I see all the same files there for both options. Thanks.

Revision history for this message
Michael Thayer (michael-thayer) wrote :

I started getting proper kernel crash dumps last week. Congratulations and thanks to whoever made that work!

Revision history for this message
dino99 (9d9) wrote :

Closing as per the latest posts.

Changed in linux (Ubuntu Oneiric):
status: Confirmed → Invalid
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.