Quantal : kexec kernel not triggered when kernel panics

Bug #1022561 reported by Louis Bouchard
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Stefan Bader
Quantal
Fix Released
Medium
Stefan Bader

Bug Description

When setting up kernel crash dump and testing with SysRQ-C, the kexec procedure does not reboot the second kernel in order to copy the vmcore info.

Steps to reproduce the problem in a Quantal VM are :

1) setup console port
2) install linux-crashdump
3) Fix crashkernel=128Mb to avoid current limitation
4) fix vmcoreinfo bug (remove references to vmcoreinfo as per LP: #988512
5) recreate initramfs / grub
6) reboot
7) test crash

This will trigger the following panic but will not follow-up with kexec loading the loaded kernel required to save the vmcore info :

root@QuantalSA-crash:~# echo c > /proc/sysrq-trigger
[ 23.740777] SysRq : Trigger a crash
[ 23.741555] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 23.743183] IP: [<ffffffff813e2c56>] sysrq_handle_crash+0x16/0x20
[ 23.744175] PGD 3bd7e067 PUD 3b933067 PMD 0
[ 23.744175] Oops: 0002 [#1] SMP
[ 23.744175] CPU 0
[ 23.744175] Modules linked in:
[ 23.744175] kvm ext2 snd_hda_intel microcode snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore virtio_balloon snd_page_alloc psmouse serio_ra
w i2c_piix4 mac_hid lp parport floppy
[ 23.744175]
[ 23.744175] Pid: 1218, comm: bash Not tainted 3.5.0-3-generic #3-Ubuntu Bochs Bochs
[ 23.744175] RIP: 0010:[<ffffffff813e2c56>] [<ffffffff813e2c56>] sysrq_handle_crash+0x16/0x20
[ 23.744175] RSP: 0018:ffff88003a979e28 EFLAGS: 00010096
[ 23.744175] RAX: 000000000000000f RBX: ffffffff81c7a020 RCX: 0000000000001aaf
[ 23.744175] RDX: 0000000000001aaf RSI: 0000000000000086 RDI: 0000000000000063
[ 23.744175] RBP: ffff88003a979e28 R08: 000000000000019e R09: 000000000000019d
[ 23.744175] R10: 0000000000038e4c R11: 0000000000000000 R12: 0000000000000063
[ 23.744175] R13: 0000000000000286 R14: 0000000000000004 R15: 0000000000000000
[ 23.744175] FS: 00007fed36a0f700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 23.744175] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.744175] CR2: 0000000000000000 CR3: 000000003b966000 CR4: 00000000000006f0
[ 23.744175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 23.744175] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 23.744175] Process bash (pid: 1218, threadinfo ffff88003a978000, task ffff88003d7adc00)
[ 23.744175] Stack:
[ 23.744175] ffff88003a979e68 ffffffff813e3301 ffff88003d7adc00 0000000000000002
[ 23.744175] ffff88003a345f00 ffffffff813e33e0 ffff88003c80d380 0000000000000002
[ 23.744175] ffff88003a979e98 ffffffff813e342a ffff88003aad1000 0000000001ea9408
[ 23.744175] Call Trace:
[ 23.744175] [<ffffffff813e3301>] __handle_sysrq+0xb1/0x190
[ 23.744175] [<ffffffff813e33e0>] ? __handle_sysrq+0x190/0x190
[ 23.744175] [<ffffffff813e342a>] write_sysrq_trigger+0x4a/0x50
[ 23.744175] [<ffffffff811e2112>] proc_reg_write+0x82/0xc0
[ 23.744175] [<ffffffff8118214c>] vfs_write+0xac/0x180
[ 23.744175] [<ffffffff8118247a>] sys_write+0x4a/0x90
[ 23.744175] [<ffffffff8167cd29>] system_call_fastpath+0x16/0x1b
[ 23.744175] Code: ff ff 45 01 f4 45 39 65 2c 75 cd 4c 89 ef e8 92 f7 ff ff eb c3 55 48 89 e5 66 66 66 66 90 c7 05 f1 2a a2 00 01 00 00 00 0f ae f8 <
c6> 04 25 00 00 00 00 01 5d c3 55 48 89 e5 53 48 83 ec 08 66 66
[ 23.744175] RIP [<ffffffff813e2c56>] sysrq_handle_crash+0x16/0x20
[ 23.744175] RSP <ffff88003a979e28>
[ 23.744175] CR2: 0000000000000000

Identical steps done in a Precise VM will trigger kexec
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu quantal (development branch)
Release: 12.10
Codename: quantal
ubuntu@QuantalSA-crash:~$ uname -a
Linux QuantalSA-crash 3.5.0-3-generic #3-Ubuntu SMP Mon Jul 2 16:49:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

CVE References

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1022561

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: quantal
Louis Bouchard (louis)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Louis,

Would it be possible for you to see if this bug also exists in the latest mainline kernel[0]? That will tell us if it is already fixed.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc6-quantal/

tags: added: kernel-da-key
tags: added: regression-release
Changed in linux (Ubuntu):
importance: Undecided → Medium
affects: linux (Ubuntu) → linux-meta (Ubuntu)
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Louis Bouchard (louis) wrote :

Hello Joe,

Just tested the most recent mainline kernel : still has the same issue with the following :

$ uname -a
Linux QuantalSA-crash 3.5.0-030500rc6-generic #201207072135 SMP Sun Jul 8 01:35:57 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We can perform a kernel bisect to identify the exact commit that introduced this regression. First we need to identify the kernel version that introduced this bug.

Can you test the following kernels and report back which one's have the bug:

v3.4 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-quantal/
v3.5-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc1-quantal/

Changed in linux (Ubuntu):
status: Confirmed → Triaged
tags: added: performing-bisect
Revision history for this message
Stefan Bader (smb) wrote :

While this probably is not helping much to find the issue, I want to add that I was able to get a panic once when kexec'ing into the dump kernel (had to set nomodeset to avoid higher resolution textmode):

PANIC: early exception 0d rip 10:ffffffff8103ee66 error 77b cr2 0

The RIP points to the retq (actually a near ret/opcode c3), the 0d is a #GP (general protection) exception. I am not yet sure what the 77b vector exactly means, but it sounds like the return address on the stack may point someplace wrong.

Revision history for this message
Stefan Bader (smb) wrote :

Hm, what I actually wanted to say was "points to the retq..." of a call to native_irq_enable call.

Revision history for this message
Stefan Bader (smb) wrote :

All cases of RET causing a #GP with an error code != 0, the error code is a segment selector. Though in the long list of explanation it seems it could mean the return code segment or the stack segment.

Revision history for this message
Louis Bouchard (louis) wrote :

@stefan

I just tried it with nomodeset added and didn't get kexec to kick in.

@joseph

I will test those two kernels and let you know the outcome. I would gladly do the bissection myself in order to learn, but I'm affraid that building kernels on my laptop will take way too long.

Revision history for this message
Louis Bouchard (louis) wrote :

@joe

v3.4 final: OK
v3.5-rc1: NOK

Revision history for this message
Stefan Bader (smb) wrote :

@Louis, yes I can confirm the same on my machine. Will start doing a bisect between the two.

Revision history for this message
Stefan Bader (smb) wrote :

Ok, bisection ended up with:

commit 722bc6b16771ed80871e1fd81c86d3627dda2ac8
Author: WANG Cong <email address hidden>
Date: Mon Mar 5 15:05:13 2012 -0800

    x86/mm: Fix the size calculation of mapping tables

    For machines that enable PSE, the first 2/4M memory region still uses
    4K pages, so needs more PTEs in this case, but
    find_early_table_space() doesn't count this.

And I was able to make the kdump kernel run by reverting back the size calculations to what they were before. Actually, I am quite confident that this change is wrong for the 64bit case. The calling function has some comments about the first 2/4M regions but that is all encapsulated into #ifdef CONFIG_X86_32.

With some debug output I get this on 64bit (without reverting 77e00000 would be added to extra space!):
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff]
[ 0.000000] [mem 0x00000000-0x77dfffff] page 2M
[ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k
[ 0.000000] mr->end(77e00000)-mr->start(0)=77e00000
[ 0.000000] extra is 88000
[ 0.000000] kernel direct mapping tables up to 0x77e87fff @ [mem 0x1fffc000-0x1fffffff]

So I think there is a substantial amount of space wasted and in the kdump case this again brings us into trouble of fitting the initrd+unpacked+kernel. Theoretically the 32bit case should be ok, but I must admit I never tested that on bare metal. yet.

Revision history for this message
Stefan Bader (smb) wrote :

So lets see what upstream thinks about this...

tags: added: patch
tags: removed: performing-bisect
Revision history for this message
Louis Bouchard (louis) wrote :

@smb @joe

Any news on this bug ? Are we hoping to get that fixed for 12.10 launch ?

Revision history for this message
Stefan Bader (smb) wrote :

The upstream discussion went stuck on the matter of: why make a difference between 32bit and 64bit, which nobody could answer. And for the patch that just makes things right, right now, that got ignored. So thats the news. There is plumbers coming up soon and I hope that maybe hpa is there and I can ask him directly. We will see...

Revision history for this message
Stefan Bader (smb) wrote :

I hope the background was explained well enough in comment #11. For the experience I just tried a KVM VM 64bit which got 2G of memory. Without the patch in comment #12 the initial page tables are about 4M big. And neither with crashkernel= 28M nor with crashkernel=256M I was able to get the crash-kexec working. I wonder whether this is because of the page tables using up some memory range that is special.
But any way, with the patch, the initial page tables are only 16K and now crashkernel=128 is sufficient to make things work.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Quantal):
status: Triaged → Fix Committed
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.5.0-12.12

---------------
linux (3.5.0-12.12) quantal-proposed; urgency=low

  [ Luis Henriques ]

  * [Config] Fix typo on control.stub.in

  [ Ricardo Salveti de Araujo ]

  * [Config] installing omapdrm specific headers for external drivers
    - LP: #1038846

  [ Seth Forshee ]

  * SAUCE: apple-gmux: Fix port address calculation in gmux_pio_write32()

  [ Stefan Bader ]

  * SAUCE: (no-up) x86/mm: Fix 64bit size of mapping tables
    - LP: #1022561

  [ Tim Gardner ]

  * SAUCE: firmware: Remove sb16 files duplicated in linux-firmware

  [ Upstream Kernel Changes ]

  * net: Allow driver to limit number of GSO segments per skb
    - LP: #1037456
    - CVE-2012-3412
  * sfc: Fix maximum number of TSO segments and minimum TX queue size
    - LP: #1037456
    - CVE-2012-3412
  * tcp: Apply device TSO segment limit earlier
    - LP: #1037456
    - CVE-2012-3412
  * cfg80211: add channel flag to prohibit OFDM operation
  * brcmsmac: use channel flags to restrict OFDM
  * gmux: Add generic write32 function
  * apple_gmux: Add support for newer hardware
  * apple_gmux: Fix ACPI video unregister
  * apple-gmux: Fix kconfig dependencies
  * vga_switcheroo: Don't require handler init callback
  * vga_switcheroo: Remove assumptions about registration/unregistration
    ordering
  * apple-gmux: Add display mux support
  * mei: add mei_quirk_probe function
    - LP: #1041164
  * mutex: Place lock in contended state after fastpath_lock failure
    - LP: #1041114
 -- Leann Ogasawara <email address hidden> Fri, 24 Aug 2012 07:13:00 -0700

Changed in linux (Ubuntu Quantal):
status: Fix Committed → Fix Released
Revision history for this message
Louis Bouchard (louis) wrote :

FYI,

I just confirmed that the 3.5.0-14-generic #19 kernel fixes this issue.

Thanks for your help

Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.