linux-crashdump fails to record crash; reports memory not reserved

Bug #321970 reported by Leigh L. Klotz, Jr.
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned
Nominated for Maverick by Thomas Dreibholz

Bug Description

I'm attempting to isolate a consistent crash on an Intrepid (upgraded from 8.04.1 LTS) using linux-crashdump and am not getting the dumps; instead, I get an error message about memory not being reserved. Indeed /var/crash comes up empty after a crash. dmesg and /proc/iomem results are included.

Below are the steps I've taken and the results:

I installed linux-crashdump with directions from http://www.linux-archive.org/ubuntu-development/113332-kernel-crash-dumps.html

After installing linux-crashdump (apparently renamed from linux-crashdump-generic) I found I had to run update-grub; otherwise /boot/grub/menu.lst had no changes.

On reboot, I get a message something like "memory was not reserved, please pass in crashkernel=X@Y".
I can't reproduce the exact message here because it doesn't appear in dmesg or log files.

Here is the kernel line that update-grub put in:

kernel /boot/vmlinuz-2.6.27-9-generic root=UUID=67ca572c-8b92-4c8f-9508
-7af28eb5e653 ro quiet splash crashkernel=384M-2G:64M@16M,2G-:128M@16M

(A web search for the crashkernel string shows that it's pretty widespread.)

Attached is the dmesg output (I have since added pci-nommconf to the kernel line so you may see that in dmesg).

Here is the result of cat /proc/iomem. I believe this shows no memory is reserved.

00000000-0009f7ff : System RAM
0009f800-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000cd1ff : Video ROM
000cd200-000cffff : pnp 00:0d
000e0000-000effff : pnp 00:0d
000f0000-000fffff : reserved
  000f0000-000fffff : System ROM
00100000-dfedffff : System RAM
  00100000-00383359 : Kernel code
  0038335a-004a567f : Kernel data
  00515000-005c0a1f : Kernel bss
dfee0000-dfee2fff : ACPI Non-volatile Storage
dfee3000-dfeeffff : ACPI Tables
dfef0000-dfefffff : reserved
e0000000-efffffff : PCI Bus 0000:01
  e0000000-efffffff : 0000:01:00.0
f0000000-f3ffffff : reserved
f4000000-f7ffffff : PCI Bus 0000:01
  f4000000-f5ffffff : 0000:01:00.0
  f6000000-f6ffffff : 0000:01:00.0
  f7000000-f701ffff : 0000:01:00.0
f8000000-f8ffffff : PCI Bus 0000:03
f9000000-faffffff : PCI Bus 0000:04
  fa000000-fa000fff : 0000:04:00.0
    fa000000-fa000fff : r8169
fb000000-fb0fffff : PCI Bus 0000:05
  fb000000-fb000fff : 0000:05:02.0
  fb001000-fb001fff : 0000:05:02.0
  fb002000-fb002fff : 0000:05:02.1
  fb003000-fb003fff : 0000:05:02.1
fb100000-fb103fff : 0000:00:1b.0
  fb100000-fb103fff : ICH HD audio
fb104000-fb1043ff : 0000:00:1d.7
  fb104000-fb1043ff : ehci_hcd
fb105000-fb1053ff : 0000:00:1a.7
  fb105000-fb1053ff : ehci_hcd
fb106000-fb1060ff : 0000:00:1f.3
fb200000-fb2fffff : PCI Bus 0000:04
  fb200000-fb21ffff : 0000:04:00.0
fec00000-ffffffff : reserved
  fed00000-fed003ff : HPET 0
  fee00000-fee00fff : Local APIC

Revision history for this message
Leigh L. Klotz, Jr. (klotz-graflex) wrote :
Revision history for this message
Nathaniel W. Turner (nturner) wrote :

I'm seeing this too (also on intrepid; clean amd64 install in my cause, though).

If I modify my grub config and change "crashkernel=384M-2G:64M@16M,2G-:128M@16M" to "crashkernel=64M@16M", kexec no longer errors out with "please reserve memory ...".

I'm not familiar with that more complex crashkernel=... syntax, but it apparently does not work with Ubuntu's intrepid kernels. (Maybe it works for some people?)

However, now kexec errors out with "Command line overflow". I think Ubuntu kernels *do* support longer kernel command lines, and after reading https://bugzilla.novell.com/show_bug.cgi?id=257968, I suspect this problem is with Ubuntu's kexec assuming on its own that the command line limit is 256 chars. I realize this is a separate bug, but I suspect anyone who gets past the original bug will hit this one.

Revision history for this message
Nathaniel W. Turner (nturner) wrote :

Fwiw, if I rebuild the kexec-tools package using the latest source in jaunty (20090000-2.0.0ubuntu3), kexec will get past the "Command line overflow" problem and load the crash kernel, but the "crashkernel=384M-2G:64M@16M,2G-:128M@16M" cmdline syntax still fails.

It almost looks like the intent was to use "64M@16M" for machines with less than 2G of RAM, and "128M@16M" for machines with more RAM. Is the kernel supposed to understand this syntax?

Revision history for this message
Andy Whitcroft (apw) wrote :

This is not a bug in the linux-meta package, moving to the linux package.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
jpoirier (jpoirier) wrote :

As more information regarding this, if you do as Nathaniel says above and use the latest kexec-tools package with the "128M@16M" syntax, you can get past the "Command line overflow" problem and load the crash kernel, but you may find that when invoking the crash kernel you get a kernel OOPs in machine_kexec, arch/x86/kernel/machine_kexec_32.c due to a NULL pointer dereference.
To avoid that you need the patch described in this bug here: http://bugzilla.kernel.org/show_bug.cgi?id=13265

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Leigh,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/lucid.

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 321970

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Leigh L. Klotz, Jr. (klotz-graflex) wrote :

I've not had any more kernel oops to test with. (The cause of the kernel oops I saw was a hardware problem.)
I'm unlikely to figure out how to get upstream kernels to test this with.

Revision history for this message
Nathaniel W. Turner (nturner) wrote :

I don't have time right now, but if someone want to see if this bug still exists on Lucid, I think these are the steps:

1. apt-get install linux-crashdump
2. reboot
3. echo c > /proc/sysrq-trigger # triggers a kernel panic
4. Wait a bit for the system to collect the crash dump

The expected result is that the system will collect some kernel debugging information in /var/crash/. If /var/crash is empty after running this test, this bug still exists.

Please include your partition layout (output of 'df') when reporting results.

Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Nathaniel W. Turner (nturner) wrote :

It may be obvious to some, but I should probably be more explicit in pointing out that the test process I describe above triggers a dirty shutdown, and can cause data loss if run on a system that contains valuable data. Please run it on a test system where you don't mind possibly losing data. (Of course, this caveat applies to running Lucid at all right now, regardless of what you're testing.)

Revision history for this message
Anders Kaseorg (andersk) wrote :

Doesn’t work for me on Lucid amd64. I have 2 GiB of memory, and I have crashkernel=384M-2G:64M,2G-:128M in /proc/cmdline, but the kernel detects that I have ever-so-slightly less than 2 GiB of memory, and therefore gives me a 64 MiB crashkernel:

[ 0.000000] Reserving 64MB of memory at 32MB for crashkernel (System RAM: 2046MB)

When I trigger a panic, it successfully starts the crashkernel, but then the crashkernel runs out of memory and panics itself.

Revision history for this message
Rhomboid (rhomboid) wrote :

I have 10 machines on dual core-i7 (amd64) with 16G RAM running 9.10. They're randomly rebooting after months of stability with absolutely no messages in the logs so I've assumed I've got a kernel panic and installed linux-crashdump. I get no output in /var/crash, other than once I had a log file that said it failed attempting to save a crash. The log file disappeared so I don't have the exact message. I also don't have console access so I'm not sure if I'm getting an error about reserved memory that's not in a log file. The machines are in a remote facility.

The wiki page (https://wiki.ubuntu.com/KernelTeam/CrashdumpRecipe) says 'In Karmic all that is needed is to install the "linux-crashdump" package. After a reboot the system should be able to catch crash dumps automatically and provide them to apport.'. This does not appear to be true. The machines also have no direct internet access so apport does not appear to work correctly, even with http_proxy set.

I don't get any crash dumps when simulating/forcing a crash (echo c > /proc/sysrq-trigger) either.

However, I have a test VM with only 2GB of RAM and I do see dumps when I force a crash. I'm wondering if the large memory (16GB) may have something to do with this. The only other thing I have set differently between the VM and the remote machines is kernel.panic=5 in sysctl. I was thinking that may be forcing a reboot before the dump can complete? I'm really not wanting to eliminate that setting though because I *need* those machines to come back if they panic due to them being offsite.

Revision history for this message
Nathaniel W. Turner (nturner) wrote : Re: [Bug 321970] Re: linux-crashdump fails to record crash; reports memory not reserved

It was a while ago, and I didn't have time to finish looking at this,
but IIRC last I checked, the code in question seemed to make some
assumptions about the way the system was partitioned --- I think it
assumed everything was all in one big / partition. Is your VM
partitioned that way, but your other machine partitioned with a separate
/var partition, by any chance?

On 06/21/2010 08:58 PM, Rhomboid wrote:
> I have 10 machines on dual core-i7 (amd64) with 16G RAM running 9.10.
> They're randomly rebooting after months of stability with absolutely no
> messages in the logs so I've assumed I've got a kernel panic and
> installed linux-crashdump. I get no output in /var/crash, other than
> once I had a log file that said it failed attempting to save a crash.
> The log file disappeared so I don't have the exact message. I also don't
> have console access so I'm not sure if I'm getting an error about
> reserved memory that's not in a log file. The machines are in a remote
> facility.
>
> The wiki page (https://wiki.ubuntu.com/KernelTeam/CrashdumpRecipe) says
> 'In Karmic all that is needed is to install the "linux-crashdump"
> package. After a reboot the system should be able to catch crash dumps
> automatically and provide them to apport.'. This does not appear to be
> true. The machines also have no direct internet access so apport does
> not appear to work correctly, even with http_proxy set.
>
> I don't get any crash dumps when simulating/forcing a crash (echo c>
> /proc/sysrq-trigger) either.
>
> However, I have a test VM with only 2GB of RAM and I do see dumps when I
> force a crash. I'm wondering if the large memory (16GB) may have
> something to do with this. The only other thing I have set differently
> between the VM and the remote machines is kernel.panic=5 in sysctl. I
> was thinking that may be forcing a reboot before the dump can complete?
> I'm really not wanting to eliminate that setting though because I *need*
> those machines to come back if they panic due to them being offsite.
>
>

Revision history for this message
Rhomboid (rhomboid) wrote :

Both the VM and native machines are one XFS filesystem (/).

Revision history for this message
Thomas Dreibholz (dreibh) wrote :

I have tested the latest kernel 2.6.35-20-generic under Maverick Beta (on 2-core i686 virtual machine) with linux-crashdump installed. After "echo c > /proc/sysrq-trigger", the system halts with a kernel panic:
[...] BUG: unable to handle kernel NULL pointer dereference at (null)
[...] IP: [<c03ce5a7>] sysrq_handle_crash+0x17/0x20
[...] *pde = 00000000
[...] Oops: 0002 [#1] SMP
...

No crashdump is written at /var/crash.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Rhomboid (rhomboid) wrote :

Update: Upgrading to 10.04 seems to have fixed crash dumps, in that I now get a core file and a .crash file gets made. Still working on the actual crash but now I have something to work with.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@ Rhomboid, did you have to change any configuration files to get crashdump to work in 10.04, other than following the steps outlined on the page:

https://wiki.ubuntu.com/KernelTeam/CrashdumpRecipe

Revision history for this message
Rhomboid (rhomboid) wrote :

It "just worked" after the upgrade. I didn't do any cleanup before the upgrade to restore the system to a pristine state between the initial installation of crashdump and the upgrade to 10.04, but I don't think I really did anything but install the package.

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.