kernel crash: Unable to handle kernel paging request for data

Bug #1301496 reported by Scott Moser
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

We've seen this happen twice now on ppc64el guests that are probably under load. I don't have a lot of the details on what was going on when they failed, but I have the stack traces.

[101168.836780] Unable to handle kernel paging request for data at address 0x00010001
[101168.836886] Faulting instruction address: 0xc000000000954b60
[101168.836934] Oops: Kernel access of bad area, sig: 11 [#1]
[101168.836971] SMP NR_CPUS=2048 NUMA pSeries
[101168.837020] Modules linked in: veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables dm_crypt
[101168.837234] CPU: 1 PID: 19760 Comm: kworker/u4:0 Not tainted 3.13.0-8-generic #28-Ubuntu
[101168.837294] Workqueue: netns .cleanup_net
[101168.837332] task: c0000003f99d43e0 ti: c0000001cce44000 task.ti: c0000001cce44000
[101168.837386] NIP: c000000000954b60 LR: c000000000954b68 CTR: c000000000954b00
[101168.837439] REGS: c0000001cce47760 TRAP: 0300 Not tainted (3.13.0-8-generic)
[101168.837493] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002024 XER: 00000000
[101168.837620] CFAR: 000000001063ea4c DAR: 0000000000010001 DSISR: 40000000 SOFTE: 1
GPR00: c000000000954b68 c0000001cce479e0 c0000000010b0dd0 0000000000010001
GPR04: f0000000099918f0 c0000002be072380 c000000000954b68 c0000003fe023508
GPR08: 0000000000010000 c000000209fc0000 000000000000000e 0000000000000001
GPR12: 0000000044002028 c00000000fe80300 c0000000000c3f00 c0000002be1e8bc0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000001 c000000000f630fc
GPR24: 0000000000000001 fffffffffffffef7 0000000000000000 c000000000f58638
GPR28: 0000000000000001 c0000003fbdc0000 0000000000002000 0000000000000000
[101168.838355] NIP [c000000000954b60] .tcp_net_metrics_exit+0x60/0x110
[101168.838402] LR [c000000000954b68] .tcp_net_metrics_exit+0x68/0x110
[101168.838448] Call Trace:
[101168.838469] [c0000001cce479e0] [c000000000954b68] .tcp_net_metrics_exit+0x68/0x110 (unreliable)
[101168.838542] [c0000001cce47a70] [c0000000008cc49c] .ops_exit_list.isra.2+0x6c/0xd0
[101168.838605] [c0000001cce47b00] [c0000000008ccef0] .cleanup_net+0x150/0x250
[101168.838662] [c0000001cce47bc0] [c0000000000b9e28] .process_one_work+0x1a8/0x4d0
[101168.838726] [c0000001cce47c60] [c0000000000baaf0] .worker_thread+0x180/0x4a0
[101168.838783] [c0000001cce47d30] [c0000000000c4010] .kthread+0x110/0x130
[101168.838841] [c0000001cce47e30] [c00000000000a160] .ret_from_kernel_thread+0x5c/0x7c
[101168.838903] Instruction dump:
[101168.838940] 7d295030 2f890000 e93d0288 419e0058 3bc00000 3b800001 60000000 60420000
[101168.839031] 7bc81f24 7c69402a 2fa30000 419e0024 <ebe30000> 4b8b809d 60000000 2fbf0000
[101168.839127] ---[ end trace fb028b2b5c006a6a ]---
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14-0ubuntu1
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-3.13.0-19-generic root=UUID=19eaa2f9-0f24-49b9-ba48-24879242481c ro console=hvc0 earlyprintk
ProcVersionSignature: User Name 3.13.0-19.40-generic 3.13.6
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-19-generic N/A
 linux-backports-modules-3.13.0-19-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.13.0-19-generic ppc64le
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy netdev plugdev sudo video
WifiSyslog:

_MarkForUpload: True

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote : AudioDevicesInUse.txt

apport information

tags: added: apport-collected trusty uec-images
description: updated
Revision history for this message
Scott Moser (smoser) wrote : BootDmesg.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : IwConfig.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcModules.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : UdevDb.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : UdevLog.txt

apport information

Revision history for this message
Jorge Castro (jorge) wrote :

I was doing a juju deploy when this happened. I got a segfault for doing a "juju status" and then the terminal/ssh connection froze almost immediately after.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1301496

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key ppc64el
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if this issue also happens on the 3.13.0-21 kernel? It can be downloaded from:

https://launchpad.net/ubuntu/trusty/+source/linux/3.13.0-21.43

The ppc64el image can be directly downloaded from:
https://launchpad.net/ubuntu/+source/linux/3.13.0-21.43/+build/5866502

Revision history for this message
Matt Bruzek (mbruzek) wrote :

We just experienced the same problem on wolfe-01 today. We were deploying charms with juju and noticed that juju status did not return the right output.

The kernel that was running is:

Linux wolfe-01 3.13.0-21-generic #43-Ubuntu SMP Mon Mar 31 22:54:04 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

I got the syslog and the dmesg file off the server and will attach them to this report.

Revision history for this message
Matt Bruzek (mbruzek) wrote :
Revision history for this message
Anton Blanchard (anton-samba) wrote :

Lots going on here. First looking at the syslog file from Matt. I notice a lot of:

Apr 3 20:57:45 wolfe-01 kernel: [ 4062.074422] jujud[1929]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000

Looks like we smashed our stack. This seems to be a separate issue because we continue on even after these failures.

At some point dpkg-query starts SEGVing:

Apr 3 20:57:54 wolfe-01 kernel: [ 4071.263070] dpkg-query[20115]: unhandled signal 11 at 0000000000010028 nip 0000000010003380 lr 0000000010002968 code 30001
Apr 3 20:57:57 wolfe-01 kernel: [ 4074.437029] dpkg-query[20208]: unhandled signal 11 at 0000000000010028 nip 0000000010003380 lr 0000000010002968 code 30001
Apr 3 20:58:00 wolfe-01 kernel: [ 4077.612284] dpkg-query[20291]: unhandled signal 11 at 0000000000010028 nip 0000000010003380 lr 0000000010002968 code 30001

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Was there a prior Trusty kernel version that did not exhibit this bug?

tags: added: kernel-key
Revision history for this message
Steve Langasek (vorlon) wrote :

fwiw, I've investigated the dpkg segfaults, and seen the following:

$ gdb dpkg
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
[...]
Reading symbols from dpkg...Reading symbols from /usr/lib/debug//usr/bin/dpkg...done.
done.
(gdb) run -l
Starting program: /usr/bin/dpkg -l

Program received signal SIGSEGV, Segmentation fault.
filesdbinit () at ../../src/filesdb.c:571
571 ../../src/filesdb.c: No such file or directory.
(gdb) print bins
$1 = {0x0 <repeats 9441 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 8191 times>, 0x10000,
  0x0 <repeats 8191 times>, 0x10000, 0x0 <repeats 6942 times>}
(gdb)

On a healthy system, this looks like:

(gdb) break filesdbinit
Breakpoint 2 at 0x10003338: file ../../src/filesdb.c, line 565.
(gdb) print bins
$12 = {0x0 <repeats 131072 times>}
(gdb)

Note that bins is an array of pointers.

(gdb) print sizeof(bins[0])
$6 = 8
(gdb)

So once every 8192 elements, there's a wrong bit in the array; 8192*8 is 64k of memory.

This could be a bug in any of the kernel, qemu, or the underlying host. Note that after a reboot of wolfe, the VMs are reported to be stable again for the past 72 hours (!). So it's possible this points to a bug with the host OS/kernel.

There is a second P7 system, postal, which has been exhibiting the same kinds of problems as wolfe. Adam can speak to this in more detail, and facilitate any necessary diagnostics on postal.

Revision history for this message
Adam Conrad (adconrad) wrote :

For what it's worth, the stability offered by a reboot was short-lived, and wolfe's gone back to hating its users.

Revision history for this message
Andy Whitcroft (apw) wrote :

Ad this seems to be reproducible we might want to spin up one of the affected machines with a 4K kernel and see if that avoids the issue. Of course as we have seen with other bugs, assumptions in the client s/w may be to blame.

Revision history for this message
Andy Whitcroft (apw) wrote :

Got hold of one of these machines in this "everything is exploding" state. Used the below test program to dump out the static variables and obtain the alignment of the corruption. (This program does not manipulate this data which eliminates a bug in dpkg as cause.) Note that the corruption is at the start of the page (and although most elided here repeats on each page thereafter):

===
#include <stdio.h>

static char b[65536 * 16];

main(int argc, char *argv[])
{
        int p;

        printf("%08lx\n", (long)b);
        for (p = 0; p < sizeof(b); p++) {
                if (b[p]) {
                        printf("%d != 0 @ %d [%08lx]\n", b[p], p, (long)&b[p]);
                }
        }
}
===
10011068
68 != 0 @ 61336 [10020000]
20 != 0 @ 61340 [10020004]
2 != 0 @ 61342 [10020006]
1 != 0 @ 61344 [10020008]
75 != 0 @ 61348 [1002000c]
3 != 0 @ 61349 [1002000d]
2 != 0 @ 61352 [10020010]
8 != 0 @ 61353 [10020011]
[...]
68 != 0 @ 126872 [10030000]
20 != 0 @ 126876 [10030004]
2 != 0 @ 126878 [10030006]
1 != 0 @ 126880 [10030008]
75 != 0 @ 126884 [1003000c]
3 != 0 @ 126885 [1003000d]
2 != 0 @ 126888 [10030010]
8 != 0 @ 126889 [10030011]
[...]
===

I also dumped the corruption in full in a more readable form, I would note that this seems to contain 'lo' and 'eth0' as if it were networking related:

===
000000 0044 0000 0014 0002 0001 0000 034b 0000
         D \0 \0 \0 024 \0 002 \0 001 \0 \0 \0 K 003 \0 \0
000010 0802 fe80 0001 0000 0008 0001 007f 0100
       002 \b 200 376 001 \0 \0 \0 \b \0 001 \0 177 \0 \0 001
000020 0008 0002 007f 0100 0007 0003 6f6c 0000
        \b \0 002 \0 177 \0 \0 001 \a \0 003 \0 l o \0 \0
000030 0014 0006 ffff ffff ffff ffff bb72 0054
       024 \0 006 \0 377 377 377 377 377 377 377 377 r 273 T \0
000040 bb72 0054 0050 0000 0014 0002 0001 0000
         r 273 T \0 P \0 \0 \0 024 \0 002 \0 001 \0 \0 \0
000050 034b 0000 1802 0080 000c 0000 0008 0001
         K 003 \0 \0 002 030 200 \0 \f \0 \0 \0 \b \0 001 \0
000060 000a 8203 0008 0002 000a 8203 0008 0004
        \n \0 003 202 \b \0 002 \0 \n \0 003 202 \b \0 004 \0
000070 000a ff03 0009 0003 7465 3068 0000 0000
        \n \0 003 377 \t \0 003 \0 e t h 0 \0 \0 \0 \0
000080 0014 0006 ffff ffff ffff ffff bcc8 0054
       024 \0 006 \0 377 377 377 377 377 377 377 377 310 274 T \0
000090 bcc8 0054 0000 0000 0000 0000 0000 0000
       310 274 T \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000a0 0000 0000 0000 0000 0000 0000 0000 0000
        \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
010000 0044 0000 0014 0002 0001 0000 034b 0000
===

I should note at this point that this differs from the corruption as seen by @vorlon which showed a single bit change in each page.

Revision history for this message
Andy Whitcroft (apw) wrote :

000000 0044 0000
          LEN TYPE
000000 0014 0002 0001 0000 034b 0000
                          LEN TYPE
000010 0802 fe80 0001 0000 0008 0001 007f 0100
                                          LEN TYPE 127.0.0.1
000020 0008 0002 007f 0100 0007 0003 6f6c 0000
          LEN TYPE 127.0.0.1 LEN TYPE lo
000030 0014 0006 ffff ffff ffff ffff bb72 0054
          LEN TYPE
000040 bb72 0054

000040 0050 0000
                          LEN TYPE
000040 0014 0002 0001 0000
                                          LEN TYPE
000050 034b 0000 1802 0080 000c 0000 0008 0001
                                                          LEN TYPE
000060 000a 8203 0008 0002 000a 8203 0008 0004
          10.0.3.130 LEN TYPE 10.0.3.130 LEN TYPE
000070 000a ff03 0009 0003 7465 3068 0000 0000
          10.0.3.255 LEN TYPE eth0

000080 0014 0006 ffff ffff ffff ffff bcc8 0054
          LEN TYPE
000090 bcc8 0054

This looks a little bit like the sort of contents we might expect to see dumped from a call to GETADDRS
against PF_UNSPEC which calls out to all of the inet{,6}_fill_ifaddr() handlers, though nested:

[IFA_LOCAL(fe80...), IFA_ADDRESS(127.0.0.1), IFA_LOCAL(127.0.0.1), IFA_LABEL(lo),IFA_CACHEINFO(...)]
[IFA_LOCAL(00008000...??), IFA_ADDRESS(10.0.3.130), IFA_LOCAL(10.0.3.130), IFA_BROADCAST(10.0.3.255), IFA_LABEL(eth0)]

Where:

  IFA_LOCAL (16 bytes, ipv6 or 4 bytes, ipv4)
  IFA_ADDRESS (4 bytes, ipv4)
  IFA_BROADCAST (4 bytes, ipv4)
  IFA_CACHEINFO (16 bytes)

Revision history for this message
Scott Moser (smoser) wrote :
Download full text (6.2 KiB)

regarding stack 4k page kernel, this just happened on wolfe-02. running (I believe) a 4k page kernel.
$ grep CONFIG_PPC.*.*PAGES /boot/config-3.13.0-8-generic
CONFIG_PPC_4K_PAGES=y
# CONFIG_PPC_64K_PAGES is not set

wolfe-02 login: [241848.101690] Unable to handle kernel paging request for data at address 0x2001400000044
[241848.112613] Faulting instruction address: 0xc000000000954b60
[241848.112704] Oops: Kernel access of bad area, sig: 11 [#1]
[241848.112777] SMP NR_CPUS=2048 NUMA pSeries
[241848.112871] Modules linked in: btrfs xor raid6_pq libcrc32c veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables dm_crypt
[241848.113353] CPU: 1 PID: 10355 Comm: kworker/u4:0 Not tainted 3.13.0-8-generic #28-Ubuntu
[241848.113465] Workqueue: netns .cleanup_net
[241848.113540] task: c0000002192621f0 ti: c000000253eec000 task.ti: c000000253eec000
[241848.113655] NIP: c000000000954b60 LR: c000000000954b68 CTR: c000000000954b00
[241848.113768] REGS: c000000253eef760 TRAP: 0300 Not tainted (3.13.0-8-generic)
[241848.113871] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44002024 XER: 00000000
[241848.114117] CFAR: c0000000001d2820 DAR: 0002001400000044 DSISR: 40000000 SOFTE: 1
GPR00: c000000000954b68 c000000253eef9e0 c0000000010b0dd0 0002001400000044
GPR04: f00000000b7698d8 c00000034674d900 c000000000954b68 c0000003fe023508
GPR08: 0000000000010000 c0000002536a0000 000000000000000e 0000000000000001
GPR12: 0000000044002028 c00000000fe80300 c0000000000c3f00 c000000363207c40
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000001 c000000000f630fc
GPR24: 0000000000000001 fffffffffffffef7 0000000000000000 c000000000f58638
GPR28: 0000000000000001 c0000003506f3900 0000000000002000 0000000000000000
[241848.115585] NIP [c000000000954b60] .tcp_net_metrics_exit+0x60/0x110
[241848.115675] LR [c000000000954b68] .tcp_net_metrics_exit+0x68/0x110
[241848.115762] Call Trace:
[241848.115805] [c000000253eef9e0] [c000000000954b68] .tcp_net_metrics_exit+0x68/0x110 (unreliable)
[241848.115945] [c000000253eefa70] [c0000000008cc49c] .ops_exit_list.isra.2+0x6c/0xd0
[241848.116078] [c000000253eefb00] [c0000000008ccef0] .cleanup_net+0x150/0x250
[241848.116198] [c000000253eefbc0] [c0000000000b9e28] .process_one_work+0x1a8/0x4d0
[241848.116320] [c000000253eefc60] [c0000000000baaf0] .worker_thread+0x180/0x4a0
[241848.116429] [c000000253eefd30] [c0000000000c4010] .kthread+0x110/0x130
[241848.116538] [c000000253eefe30] [c00000000000a160] .ret_from_kernel_thread+0x5c/0x7c
[241848.116659] Instruction dump:
[241848.116731] 7d295030 2f890000 e93d0288 419e0058 3bc00000 3b800001 60000000 60420000
[241848.116920] 7bc81f24 7c69402a 2fa30000 419e0024 <ebe30000> 4b8b809d 60000000 2fbf0000
[241848.117115] ---[ end trace 531dcfc8ed4b2948 ]---
[241848.124367]
[241848.129853] Unable to handle kernel paging request for data at address 0xffffffffffffffd8
[241848.129968] Faulting instruction address: 0xc0000000000c49c0
[241848.130056] Oops: Kernel access of bad area, sig: 11 [#2]
[241848.130128...

Read more...

tags: removed: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Test with newer development kernel (3.13.0-24.46)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

  With the recent release of this Ubuntu release, would like to confirm if this bug is still present. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get dist-upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.13.0-24.46
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.