IBM POWER8 unhandled signal 11 / SEGV

Bug #1508767 reported by Haw Loeung
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
apparmor (Ubuntu)
Invalid
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Medium
Unassigned
linux-meta-lts-vivid (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,

We have a few IBM POWER8 servers which we're currently using as OpenStack nova compute nodes. It seems we're regularly running into issues where processes are segfaulting:

| hloeung@gligar:~$ zgrep -E '(SEGV)|(unhandled signal 11)' /var/log/syslog.5.gz
| Oct 16 23:31:38 gligar kernel: [88351.465559] neutron-openvsw[29733]: unhandled signal 11 at 88f9010000000000 nip 00000000100ba0d8 lr 00000000101ad860 code 30001
| Oct 16 23:31:38 gligar kernel: [88351.566909] init: neutron-plugin-openvswitch-agent main process (29733) killed by SEGV signal
| Oct 16 23:31:38 gligar kernel: [88351.746611] apport[29500]: unhandled signal 11 at 8850e467250040a8 nip 0000000010201f80 lr 0000000010202984 code 30001
| Oct 16 23:31:39 gligar kernel: [88352.245829] neutron-rootwra[29749]: unhandled signal 11 at 0809c4b610000000 nip 000000001014ae4c lr 000000001014b544 code 30001
| Oct 16 23:31:50 gligar kernel: [88364.040340] neutron-rootwra[30060]: unhandled signal 11 at 08a305c12b000000 nip 00000000100b74d0 lr 00000000100b73e4 code 30001
| Oct 16 23:31:51 gligar kernel: [88364.174218] neutron-rootwra[30065]: unhandled signal 11 at 088eb28e2f004078 nip 00000000100b5974 lr 00000000100aa794 code 30001
| Oct 16 23:31:52 gligar kernel: [88365.195380] neutron-rootwra[30098]: unhandled signal 11 at 88c939e322000008 nip 00000000100c8b28 lr 0000000010060384 code 30001
| Oct 16 23:31:52 gligar kernel: [88365.362374] neutron-rootwra[30106]: unhandled signal 11 at 882c58ad2800f04f nip 00003fffaef81220 lr 00003fffaef811a0 code 30001
| Oct 16 23:32:27 gligar kernel: [88400.966976] neutron-rootwra[30341]: unhandled signal 11 at 88d1fbe922001008 nip 00000000100c8b28 lr 0000000010060384 code 30001
| Oct 16 23:32:47 gligar kernel: [88420.953053] neutron-rootwra[30412]: unhandled signal 11 at 11b6629054008000 nip 00003fff9a864ac4 lr 00003fff9a84c42c code 30001
| Oct 16 23:34:49 gligar kernel: [88542.778503] neutron-rootwra[30977]: unhandled signal 11 at 88540f00000010a8 nip 00000000100aa768 lr 00000000100b74e8 code 30001
| Oct 16 23:35:23 gligar kernel: [88576.700721] neutron-openvsw[29739]: unhandled signal 11 at 08bfcbf7210000a8 nip 00000000100ab390 lr 00000000100b7c38 code 30001
| Oct 16 23:35:23 gligar kernel: [88576.804961] init: neutron-plugin-openvswitch-agent main process (29739) killed by SEGV signal
| Oct 16 23:36:01 gligar kernel: [88614.995497] nova-compute[31662]: unhandled signal 11 at 8846c1c81f004008 nip 000000001014c2f0 lr 0000000010151080 code 30001
| Oct 16 23:36:02 gligar kernel: [88615.110735] nova-compute[4331]: unhandled signal 11 at 88befae9220010a8 nip 00000000100b5c8c lr 000000001014c734 code 30001
| Oct 16 23:36:02 gligar kernel: [88615.219436] init: nova-compute main process (4331) killed by SEGV signal
| Oct 17 03:59:56 gligar kernel: [104449.890256] landscape-packa[63283]: unhandled signal 11 at 02f0000000000008 nip 00000000101abeac lr 00000000100a8738 code 30001
| Oct 17 04:05:00 gligar kernel: [104753.718195] sudo[63915]: unhandled signal 11 at 08e06105d1dcfff8 nip 00003fffb15cf7e4 lr 00003fffb15cfa00 code 30001

| hloeung@floette:~$ zgrep -E '(SEGV)|(unhandled signal 11)' /var/log/syslog.7.gz
| Oct 14 16:55:30 floette kernel: [149326.697938] rsync[9915]: unhandled signal 11 at 00003ffff7cb0000 nip 00003fffa242d054 lr 00003fffa2426560 code 30001
| Oct 14 21:05:57 floette kernel: [164353.333697] apparmor_parser[102284]: unhandled signal 11 at 08680f0000000000 nip 000000001004bbf8 lr 0000000010028de4 code 30001
| Oct 14 22:21:24 floette kernel: [168880.481778] neutron-rootwra[153488]: unhandled signal 11 at 8860fbe21f0000a8 nip 00000000100aa768 lr 00000000100b74e8 code 30001
| Oct 14 22:21:26 floette kernel: [168882.078608] neutron-openvsw[4546]: unhandled signal 11 at 8822cbf03d000008 nip 00000000100aa764 lr 00000000100e6900 code 30001
| Oct 14 22:21:37 floette kernel: [168893.597834] init: neutron-plugin-openvswitch-agent main process (4546) killed by SEGV signal
| Oct 14 22:21:39 floette kernel: [168894.949777] nova-rootwrap[153708]: unhandled signal 11 at 88d495c93c0000a8 nip 00000000100a57d4 lr 00000000100ab42c code 30001
| Oct 14 22:21:43 floette kernel: [168898.973700] neutron-rootwra[153847]: unhandled signal 11 at 08c90df318000020 nip 00000000101ac260 lr 00000000101ad92c code 30001
| Oct 14 22:21:44 floette kernel: [168900.785421] neutron-rootwra[153850]: unhandled signal 11 at 88d87b783f0000a8 nip 00000000101abf40 lr 00000000100d9cac code 30001
| Oct 14 22:21:46 floette kernel: [168902.724121] neutron-openvsw[153852]: unhandled signal 11 at 882b78783f0000a8 nip 00000000100b5c8c lr 000000001014c734 code 30001

| hloeung@patrat:~$ zgrep -E '(SEGV)|(unhandled signal 11)' /var/log/syslog.7.gz
| Oct 15 00:48:13 patrat kernel: [553143.677075] rsync[89656]: unhandled signal 11 at 00003fffe6a50000 nip 00003fff77e0d054 lr 00003fff77e06560 code 30001

| Oct 16 02:42:03 wailmer kernel: [862104.157449] nova-compute[11431]: unhandled signal 11 at 081169bc370000a8 nip 00000000100ac164 lr 00000000100b7d6c code 30001
| Oct 16 02:42:03 wailmer kernel: [862104.264242] init: nova-compute main process (11431) killed by SEGV signal
| Oct 16 06:38:22 wailmer kernel: [876282.603855] qemu-img[78662]: unhandled signal 11 at 11b625104e000000 nip 00003fffb6224bb4 lr 00003fffb620c42c code 30001
| Oct 16 06:38:23 wailmer kernel: [876283.336045] qemu-system-ppc[78609]: unhandled signal 11 at ffffffc10000009a nip 00003fffae1a7124 lr 0000000010314874 code 30001
| Oct 16 06:39:40 wailmer kernel: [876360.399550] neutron-rootwra[79380]: unhandled signal 11 at 0800c20428000000 nip 00000000100a6c14 lr 00000000100a6d4c code 30001
| Oct 16 06:39:47 wailmer kernel: [876367.577184] neutron-rootwra[79676]: unhandled signal 11 at 0878a100000040a8 nip 00000000100aa768 lr 000000001004ed6c code 30001
| Oct 16 06:39:49 wailmer kernel: [876369.478066] neutron-openvsw[12655]: unhandled signal 11 at 088e47f11f000008 nip 00000000100db46c lr 00000000100db424 code 30001
| Oct 16 06:39:58 wailmer kernel: [876378.286827] init: neutron-plugin-openvswitch-agent main process (12655) killed by SEGV signal
| Oct 16 06:39:59 wailmer kernel: [876379.211801] sudo[79703]: unhandled signal 11 at 886baddd38005000 nip 886baddd38005000 lr 00003fff7da870a8 code 30001
| Oct 16 06:40:00 wailmer kernel: [876380.344562] libvirtd[109725]: unhandled signal 11 at 88806be02f000000 nip 00003fff78a70684 lr 00003fff78ab7a5c code 30001
| Oct 16 06:40:06 wailmer kernel: [876386.781123] init: libvirt-bin main process (109725) killed by SEGV signal
| Oct 16 06:40:06 wailmer kernel: [876386.818672] sudo[79919]: unhandled signal 11 at 11bda1eb70000000 nip 00003fff82094ac4 lr 00003fff8207c42c code 30001
| Oct 16 06:40:06 wailmer kernel: [876386.921414] neutron-openvsw[79689]: unhandled signal 11 at 88f8010000005000 nip 00000000100ba0d8 lr 00000000100c97c8 code 30001
| Oct 16 06:40:06 wailmer kernel: [876387.024431] init: neutron-plugin-openvswitch-agent main process (79689) killed by SEGV signal

These servers are all running Trusty with hwe-v kernel (3.19.0-31-generic #36~14.04.1-Ubuntu).

ProblemType: Crash
DistroRelease: Ubuntu 14.04
Package: nova-compute 1:2015.1.1-0ubuntu1~cloud2 [origin: Canonical]
ProcVersionSignature: Ubuntu 3.19.0-30.34~14.04.1-generic 3.19.8-ckt6
Uname: Linux 3.19.0-30-generic ppc64le
ApportVersion: 2.14.1-0ubuntu3.16
Architecture: ppc64el
CrashDB:
 {
                "impl": "launchpad",
                "project": "cloud-archive",
                "bug_pattern_url": "http://people.canonical.com/~ubuntu-archive/bugpatterns/bugpatterns.xml",
             }
Date: Fri Oct 16 23:30:00 2015
ExecutablePath: /usr/bin/nova-compute
InterpreterPath: /usr/bin/python2.7
PackageArchitecture: all
ProcCmdline: /usr/bin/python /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
ProcLoadAvg: 1.98 1.32 1.28 3/1516 7754
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -1
ProcVersion: Linux version 3.19.0-30-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #34~14.04.1-Ubuntu SMP Fri Oct 2 22:21:52 UTC 2015
Signal: 6
SourcePackage: nova
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: libvirtd
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_smt: SMT is off
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 22 03:34 seq
 crw-rw---- 1 root audio 116, 33 Oct 22 03:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.18
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 14.04
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Package: linux-meta-lts-vivid
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_GB
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=fcd256a9-8aa6-4805-95ae-f8c635967753 ro console=ttyS1
ProcLoadAvg: 3.77 2.83 2.55 3/1574 89091
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -1
ProcVersion: Linux version 3.19.0-31-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #36~14.04.1-Ubuntu SMP Thu Oct 8 10:25:49 UTC 2015
ProcVersionSignature: Ubuntu 3.19.0-31.36~14.04.1-generic 3.19.8-ckt7
RelatedPackageVersions:
 linux-restricted-modules-3.19.0-31-generic N/A
 linux-backports-modules-3.19.0-31-generic N/A
 linux-firmware 1.127.16
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.19.0-31-generic ppc64le
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm
_MarkForUpload: True
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 2.016 GHz (cpu 80)
 max: 3.691 GHz (cpu 32)
 avg: 3.527 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No firmware implementation of function
cpu_smt: SMT is off

Revision history for this message
Haw Loeung (hloeung) wrote :
Revision history for this message
William Grant (wgrant) wrote :

These machines are in scalingstack, so they have a great many instances, mostly living for <15 minutes each. The configuration is currently 3.19-on-3.19, and it shows rare memory corruption in guests and frequent segfaults and occasional kernel hangs on the host.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Haw Loeung (hloeung)
information type: Private → Public
Revision history for this message
Haw Loeung (hloeung) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Haw Loeung (hloeung) wrote : CRDA.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : DeviceTree.tar.gz

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : IwConfig.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : Lspci.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcLocks.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcMisc.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcModules.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcPpc64.tar.gz

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : UdevDb.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : UdevLog.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : WifiSyslog.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : nvram.gz

apport information

Andy Whitcroft (apw)
Changed in apparmor (Ubuntu):
status: New → Invalid
Revision history for this message
Chris J Arges (arges) wrote :

1) Can you test an lts-wily kernel which is in our CKT PPA:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+packages

2) How many times a day does this actually occur? Does this only occur on nova nodes? Are they fairly loaded in terms of memory when this occurs?

3) Another potential test would be to disable KSM to see if that's the culprit. As root:
 echo 0 > /sys/kernel/mm/ksm/run

4) Can you get the machine to generate userspace core dumps when programs segv?
 ulimit -c unlimited

I can also generate a kernel which BUG()s on _exception with code 30001, which may give us more insight. But the above information might help.

Changed in linux (Ubuntu):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → Medium
Revision history for this message
Haw Loeung (hloeung) wrote :

7 day log rotate:

| hloeung@floette:~$ zgrep -h SEGV /var/log/syslog*
| Oct 28 14:46:34 floette kernel: [1351174.845829] init: rsyslog main process (2652) killed by SEGV signal

| hloeung@bagon:~$ zgrep -h SEGV /var/log/syslog*
| Nov 2 22:17:03 bagon kernel: [2401829.665556] init: neutron-plugin-openvswitch-agent main process (124418) killed by SEGV signal
| Nov 3 00:20:46 bagon kernel: [2409252.286349] init: nova-compute main process (125673) killed by SEGV signal
| Oct 26 11:51:25 bagon kernel: [1759496.565022] init: nova-compute main process (94922) killed by SEGV signal
| Oct 26 11:56:08 bagon kernel: [1759778.693294] init: neutron-plugin-openvswitch-agent main process (18574) killed by SEGV signal
| Oct 26 11:56:23 bagon kernel: [1759794.417232] init: neutron-plugin-openvswitch-agent main process (95171) killed by SEGV signal

| hloeung@gligar:~$ zgrep -h SEGV /var/log/syslog*
| Oct 30 18:32:56 gligar kernel: [745109.275184] init: neutron-plugin-openvswitch-agent main process (4705) killed by SEGV signal
| Oct 30 18:32:56 gligar kernel: [745109.776233] init: neutron-plugin-openvswitch-agent main process (88517) killed by SEGV signal
| Oct 30 18:32:57 gligar kernel: [745110.335622] init: neutron-plugin-openvswitch-agent main process (88527) killed by SEGV signal

| hloeung@patrat:~$ zgrep -h SEGV /var/log/syslog*
| Oct 27 08:18:29 patrat kernel: [508926.329315] init: neutron-plugin-openvswitch-agent main process (51113) killed by SEGV signal

I've disabled KSM as suggested. I'll try get wgrant or cjwatson to trigger a full rebuild and get some load on these compute nodes.

Revision history for this message
Colin Watson (cjwatson) wrote :

A full rebuild is in progress at the moment, and nova-compute segfaulted yesterday, so it doesn't look as though disabling KSM was enough.

Revision history for this message
Chris J Arges (arges) wrote :

Ok, at the moment it sounds like getting this to repo in a non-production environment would be helpful. This way we can do experiments like running a newer kernel, getting core dumps properly, and running an instrumented kernel that will dump on _exception with code 30001. I'm trying on my end to reproduce so I can instrument.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-meta-lts-vivid (Ubuntu):
status: New → Confirmed
Revision history for this message
Haw Loeung (hloeung) wrote :

Disabling KSM doesn't seem to have helped. Ryan's (http://launchpad.net/~fo0bar) been working on getting hwe-w installed on these compute nodes to see if a more recent kernel will help.

Revision history for this message
Chris J Arges (arges) wrote :

This is another bug that produces segvs on power8 (could be related, not sure yet):
https://bugzilla.redhat.com/show_bug.cgi?id=1180633

I was able to reproduce this on wily/4.2 ppc64el machine trivially, and got the following output:
gcc -g test.c -O2 -o test -fgnu-tm -lpthread
ulimit -c unlimited
while ./test ; do :; done

Segmentation fault (core dumped)

[13861.517681] test[78415]: unhandled signal 11 at 0000000000000284 nip 000000001000ccc8 lr 000000001000d1c8 code 30001

(gdb) core ./core
[New LWP 78415]
[New LWP 78412]
[New LWP 78411]
[New LWP 78416]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `./test '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000001000ccc8 in GTM::gtm_thread::trycommit() ()
[Current thread is 1 (Thread 0x3fff90cef040 (LWP 78415))]
(gdb) bt
#0 0x000000001000ccc8 in GTM::gtm_thread::trycommit() ()
#1 0x00003fff90cee600 in ?? ()
#2 0x0000000000000000 in ?? ()

;

However all this being said, a generic crash like this could happen from many different causes. Since we hit this with many packages could mean a shared library or compiler is causing the issue.

For the reported bug, can I please have core files produced from these crashes along with the exact binary version of the program that crashes? This way I can get some analysis similar to the above and start debugging.

Thanks,
--chris

Revision history for this message
Chris J Arges (arges) wrote :

Another potentially related issue:
https://bugs.launchpad.net/ubuntu/+source/python-greenlet/+bug/1446974

I just tried comment #26 w/ -O0 and I could get a segfault.

Chris J Arges (arges)
tags: added: kernel-key
Revision history for this message
Chris J Arges (arges) wrote :

So just to reiterate, I'd like to see a core file. Can you run the following before you run your workload (as root):
Obviously change the directory to some place you'll retain these core files.

echo "/home/ubuntu/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
ulimit -c unlimited

Then run the workload. If you observe SEGVs find all core.* files and tar them up here. Also getting the exact package version would be helpful so I can get a backtrace.

If you can get this working with apport that's fine too, I just want to see some core files and backtraces.

Thanks,
--chris

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
James Page (james-page) wrote :

Marking UCA bug task as Invalid; is this an ongoing issue still? the bug has not been updated in 10 months so I'm assuming either something got fixed, or everyone has moved onto something else.

Changed in cloud-archive:
status: New → Invalid
Revision history for this message
Apport retracing service (apport) wrote : Crash report cannot be processed

Thank you for your report!

However, processing it in order to get sufficient information for the
developers failed, since the report is ill-formed. Perhaps the report data got
modified?

  need more than 1 value to unpack

If you encounter the crash again, please file a new report.

Thank you for your understanding, and sorry for the inconvenience!

tags: removed: need-ppc64el-retrace
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.