Comment 0 for bug 1832915

Revision history for this message
bugproxy (bugproxy) wrote :

== Comment: #0 - SRIKANTH AITHAL <email address hidden> - 2019-02-20 23:42:23 ==
---Problem Description---
while running KVM guests, we are observing numad crashes on host.

Contact Information = <email address hidden>

---uname output---
Linux ltcgen6 4.15.0-1016-ibm-gt #18-Ubuntu SMP Thu Feb 7 16:58:31 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = witherspoon

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 1. check status of numad, if stopped start it
2. start a kvm guest
3. Run some memory tests inside guest

On the host after few minutes we see numad crashing. I had enabled debug log for numad, seeing below messages in numad.log before it crashes:

8870669: PID 88781: (qemu-system-ppc), Threads 6, MBs_size 15871, MBs_used 11262, CPUs_used 400, Magnitude 4504800, Nodes: 0,8
Thu Feb 21 00:12:10 2019: PICK NODES FOR: PID: 88781, CPUs 470, MBs 18671
Thu Feb 21 00:12:10 2019: PROCESS_MBs[0]: 9201
Thu Feb 21 00:12:10 2019: Node[0]: mem: 0 cpu: 6
Thu Feb 21 00:12:10 2019: Node[1]: mem: 0 cpu: 6
Thu Feb 21 00:12:10 2019: Node[2]: mem: 1878026 cpu: 4666
Thu Feb 21 00:12:10 2019: Node[3]: mem: 0 cpu: 6
Thu Feb 21 00:12:10 2019: Node[4]: mem: 0 cpu: 6
Thu Feb 21 00:12:10 2019: Node[5]: mem: 2194058 cpu: 4728
Thu Feb 21 00:12:10 2019: Totmag[0]: 94112134
Thu Feb 21 00:12:10 2019: Totmag[1]: 109211855
Thu Feb 21 00:12:10 2019: Totmag[2]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[3]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[4]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[5]: 2990058
Thu Feb 21 00:12:10 2019: best_node_ix: 1
Thu Feb 21 00:12:10 2019: Node: 8 Dist: 10 Magnitude: 10373506224
Thu Feb 21 00:12:10 2019: Node: 0 Dist: 40 Magnitude: 8762869316
Thu Feb 21 00:12:10 2019: Node: 253 Dist: 80 Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 254 Dist: 80 Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 252 Dist: 80 Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 255 Dist: 80 Magnitude: 0
Thu Feb 21 00:12:10 2019: MBs: 18671, CPUs: 470
Thu Feb 21 00:12:10 2019: Assigning resources from node 5
Thu Feb 21 00:12:10 2019: Node[0]: mem: 2007348 cpu: 1908
Thu Feb 21 00:12:10 2019: MBs: 0, CPUs: 0
Thu Feb 21 00:12:10 2019: Assigning resources from node 2
Thu Feb 21 00:12:10 2019: Process 88781 already 100 percent localized to target nodes.

On syslog we see sig 11:
[88726.086144] numad[88879]: unhandled signal 11 at 000000e38fe72688 nip 0000782ce4dcac20 lr 0000782ce4dcf85c code 1

Stack trace output:
 no

Oops output:
 no

System Dump Info:
  The system was configured to capture a dump, however a dump was not produced.

*Additional Instructions for <email address hidden>:
-Attach sysctl -a output output to the bug.

== Comment: #2 - SRIKANTH AITHAL <email address hidden> - 2019-02-20 23:44:38 ==

== Comment: #3 - SRIKANTH AITHAL <email address hidden> - 2019-02-20 23:48:20 ==
I was using stressapptest to run memory workload inside the guest
`stressapptest -s 200`

== Comment: #5 - Brian J. King <email address hidden> - 2019-03-08 09:17:29 ==
Any update on this?

== Comment: #6 - Leonardo Bras Soares Passos <email address hidden> - 2019-03-08 11:59:16 ==
Yes, I have been working on this for a while.

After a suggestion of @lagarcia, I tested the bug on the same machine, booted on default kernel (4.15.0-45-generic) and also booted the vm with the same generic kernel.
Results are that the bug also happens with 4.15.0-45-generic. So, it may not be a problem of the changes included on kernel 4.15.0-1016.18-fix1-ibm-gt.

A few things I noticed, that may be interesting to solve this bug:
- I had a very hard time to reproduce the bug on numad that started on boot. If I restart, or stop/start, the bug reproduces much easier.
- I debugged numad using gdb and I found out it is getting segfault on _int_malloc(), from glibc.

Attached is an occurrence of the bug, while numad was on gdb.
(systemctl start numad ; gdb /usr/bin/numad $NUMAD_PID)

== Comment: #7 - Leonardo Bras Soares Passos <email address hidden> - 2019-03-08 12:00:00 ==

== Comment: #8 - Leonardo Bras Soares Passos <email address hidden> - 2019-03-11 17:04:25 ==
I reverted the whole system to vanilla Ubuntu Bionic, and booted on 4.15.0-45-generic kernel.
Linux ltcgen6 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

Then I booted the guest, also on 4.15.0-45-generic.
Linux ubuntu 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

I tried to reproduce the error, and I was able to.
It probably means this bug was not introduced by the changes on qemu/kernel, and it is present in the current repository of Ubuntu.

Next step should be doing a deeper debug on numad, in order to identify why it is getting segfault.