kernel crash kvm guest on power8

Bug #1350889 reported by Scott Moser
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

subject is extremely vague, I'll give all the information I have here.
We've got a number of kvm guests running on a power8 host:

$ head -n 5 /proc/cpuinfo
processor : 0
cpu : POWER8E (raw), altivec supported
clock : 4116.000000MHz
revision : 2.0 (pvr 004b 0200)

$ rpm -qf `which qemu-system-ppc64`
qemu-system-ppc-1.6.0-2.pkvm2_1.9.2.ppc64

$ uname -r
3.10.23-900.pkvm2_1.1.ppc64

$ ps axw | grep [s]tilson-01
 83440 pts/1 S+ 0:00 /bin/bash /home/shared/bin/guest-start stilson-01
 83459 pts/1 Sl+ 4882:24 qemu-system-ppc64 -enable-kvm -M pseries -cpu host -smp cores=2,threads=1 -m 8G -net nic,macaddr=52:54:00:00:43:81 -net tap,script=no,downscript=no,ifname=tap4381 -device spapr-vscsi -drive file=/var/lib/libvirt/images/shared/stilson-01/disk1.img -drive file=/var/lib/libvirt/images/shared/stilson-01/eph0.img -nographic -vga none

The ppc64el Ubuntu 14.04 guest was found dead, and showed on console:

[708665.375169] Unable to handle kernel paging request for data at address 0x00bf0000
[708665.375321] Faulting instruction address: 0xc0000000004b790c
[708665.375388] Oops: Kernel access of bad area, sig: 11 [#1]
[708665.375440] SMP NR_CPUS=2048 NUMA pSeries
[708665.375513] Modules linked in: 8021q garp mrp ebtable_broute ebtable_filter sunrpc nfnetlink_queue nfnetlink_log nfnetlink btrfs raid6_pq xor ufs msdos xfs libcrc32c veth xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables dm_crypt
[708665.376496] CPU: 0 PID: 270 Comm: cgmanager Not tainted 3.13.0-32-generic #57-Ubuntu
[708665.376644] task: c0000001fa55be10 ti: c0000001f8d68000 task.ti: c0000001f8d68000
[708665.376786] NIP: c0000000004b790c LR: c0000000004b78d8 CTR: c0000000004b7860
[708665.376931] REGS: c0000001f8d6b680 TRAP: 0300 Not tainted (3.13.0-32-generic)
[708665.377084] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 42008484 XER: 20000000
[708665.377411] CFAR: c00000000000a86c DAR: 0000000000bf0000 DSISR: 40000000 SOFTE: 0
GPR00: c0000000004b4710 c0000001f8d6b900 c00000000164edb8 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000000003 0000000000bf0000 0000000000bf0000 c0000000000172a0
GPR12: 0000000022002482 c00000000fe40000 fffffffffffffe80 fffffffffffffe90
GPR16: fffffffffffffea0 fffffffffffffeb0 fffffffffffffec0 0000000000000000
GPR20: 0000010003df8420 c0000000b1bfbe00 0000000000000400 0000000000000000
GPR24: 0000000000000001 c0000000015f2ed8 0000000000000000 c0000000b1bfbdc0
GPR28: c0000000016e4c80 c0000000016e4fc4 c0000000078e6600 c0000001f8d6b9d0
[708665.379401] NIP [c0000000004b790c] .tg_prfill_cpu_rwstat+0xac/0x180
[708665.379517] LR [c0000000004b78d8] .tg_prfill_cpu_rwstat+0x78/0x180
[708665.379643] Call Trace:
[708665.379694] [c0000001f8d6b900] [c000000130138e18] 0xc000000130138e18 (unreliable)
[708665.379854] [c0000001f8d6ba20] [c0000000004b4710] .blkcg_print_blkgs+0xf0/0x1b0
[708665.380026] [c0000001f8d6bae0] [c0000000004b7730] .tg_print_cpu_rwstat+0x50/0x80
[708665.380204] [c0000001f8d6bb70] [c0000000001381ec] .cgroup_seqfile_show+0x9c/0xc0
[708665.380384] [c0000001f8d6bc00] [c000000000287e98] .seq_read+0x158/0x570
[708665.380526] [c0000001f8d6bcf0] [c0000000002560d4] .vfs_read+0xc4/0x1f0
[708665.380679] [c0000001f8d6bd90] [c000000000256ef4] .SyS_read+0x64/0xe0
[708665.380811] [c0000001f8d6be30] [c00000000000a158] syscall_exit+0x0/0x98
[708665.380956] Instruction dump:
[708665.381040] 813d0000 3be100d0 7c6307b4 7f891800 409d00b0 3d420009 392a1ca8 786a1f24
[708665.381268] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
[708665.381505] ---[ end trace 512b8ac8f55926fd ]---
[708665.387710]

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-32-generic 3.13.0-32.57
ProcVersionSignature: User Name 3.13.0-32.57-generic 3.13.11.4
Uname: Linux 3.13.0-32-generic ppc64le
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Thu Jul 31 14:32:40 2014
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-3.13.0-32-generic root=UUID=44b83b05-4fa2-4d3f-b3af-4c8a4d097f90 ro console=hvc0 earlyprintk
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-32-generic N/A
 linux-backports-modules-3.13.0-32-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1350889

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

not a lot more info, buthappened again.
[54012.344169] Unable to handle kernel paging request for data at address 0x00bf0000
[54012.344428] Faulting instruction address: 0xc0000000004b790c
[54012.344566] Oops: Kernel access of bad area, sig: 11 [#1]
[54012.344684] SMP NR_CPUS=2048 NUMA pSeries
[54012.344800] Modules linked in: ebtable_broute ebtable_filter sunrpc veth xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables dm_crypt
[54012.345601] CPU: 0 PID: 279 Comm: cgmanager Not tainted 3.13.0-32-generic #57-Ubuntu
[54012.345760] task: c0000001f8d266f0 ti: c0000000030b4000 task.ti: c0000000030b4000
[54012.345917] NIP: c0000000004b790c LR: c0000000004b78d8 CTR: c0000000004b7860
[54012.346056] REGS: c0000000030b7680 TRAP: 0300 Not tainted (3.13.0-32-generic)
[54012.346223] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 42008484 XER: 20000000
[54012.346559] CFAR: c00000000000a86c DAR: 0000000000bf0000 DSISR: 40000000 SOFTE: 0
GPR00: c0000000004b4710 c0000000030b7900 c00000000164edb8 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000000003 0000000000bf0000 0000000000bf0000 c0000000000172a0
GPR12: 0000000022002482 c00000000fe40000 fffffffffffffe80 fffffffffffffe90
GPR16: fffffffffffffea0 fffffffffffffeb0 fffffffffffffec0 0000000000000000
GPR20: 0000010032aa8ba0 c0000000039d3040 0000000000000400 0000000000000000
GPR24: 0000000000000001 c0000000015f2ed8 0000000000000000 c0000000039d3000
GPR28: c0000000016e4c80 c0000000016e4fc4 c0000001f95cb600 c0000000030b79d0
[54012.348612] NIP [c0000000004b790c] .tg_prfill_cpu_rwstat+0xac/0x180
[54012.348731] LR [c0000000004b78d8] .tg_prfill_cpu_rwstat+0x78/0x180
[54012.348855] Call Trace:
[54012.348906] [c0000000030b7900] [c0000000570ac720] 0xc0000000570ac720 (unreliable)
[54012.349086] [c0000000030b7a20] [c0000000004b4710] .blkcg_print_blkgs+0xf0/0x1b0
[54012.349245] [c0000000030b7ae0] [c0000000004b7730] .tg_print_cpu_rwstat+0x50/0x80
[54012.349424] [c0000000030b7b70] [c0000000001381ec] .cgroup_seqfile_show+0x9c/0xc0
[54012.349617] [c0000000030b7c00] [c000000000287e98] .seq_read+0x158/0x570
[54012.349787] [c0000000030b7cf0] [c0000000002560d4] .vfs_read+0xc4/0x1f0
[54012.349956] [c0000000030b7d90] [c000000000256ef4] .SyS_read+0x64/0xe0
[54012.350119] [c0000000030b7e30] [c00000000000a158] syscall_exit+0x0/0x98
[54012.350271] Instruction dump:
[54012.350348] 813d0000 3be100d0 7c6307b4 7f891800 409d00b0 3d420009 392a1ca8 786a1f24
[54012.350611] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
[54012.350888] ---[ end trace 204d43fdb85141e4 ]---
[54012.357051]

Revision history for this message
Jorge Castro (jorge) wrote :

(Trying to add workload information)

This bug does not seem to manifest on the machine unless it is under heavy CPU and IO load. We are using Juju with the LXC provider to bootstrap an environment, then deploy a charm bundle, run tests, then tear it down.

There are 170 charms we are testing so we wrote a script to do this. It gets about 50% of the way through when the machine just hard locks. We are not using LXC snapshots with btrfs if that helps.

Revision history for this message
Jorge Castro (jorge) wrote :

Upgraded kernel to apw's provided binaries: http://people.canonical.com/~apw/lp1350889-trusty/

Kicked off the testing script, it'll take a few hours for it to run.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.