crash (on ppc64) when restarting numad while huge guest is active

Bug #1836913 reported by Christian Ehrhardt  on 2019-07-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Low
bugproxy
numad (Ubuntu)
Low
bugproxy
Bionic
Undecided
Unassigned

Bug Description

while verifying bug 1832915 I found "by accident" that this crash (at least on our power 9 box seems to happen often.

Case:
- huge kvm guest running
- restart numad
=> Numad crashes.

Steps to recreate:
1. deploy P9 Bionic (or later) system
2. install uvtool
   $ apt install uvttool-libvirt
3. log out & in to get permissions right
4. sync images
   $ uvt-simplestreams-libvirt --verbose sync --source http://cloud- images.ubuntu.com/daily arch=ppc64el label=daily release=eoan
6. install and manually start numad
   $ apt install numad
   $ systemctl start numad
5. spawn guest
   $ uvt-kvm create --memory $((1024*64)) --cpu 64 --password ubuntu eoan arch=ppc64el release=eoan label=daily
6. restart numad
   $ systemctl restart numad

The crash seems related to some re-init of a static structure:

stack trace ---
#0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950
        e = 0x9a5ddc1950
        e = <optimized out>
        __PRETTY_FUNCTION__ = "tcache_get"
#1 __GI___libc_malloc (bytes=16) at malloc.c:3058
        ar_ptr = <optimized out>
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __PRETTY_FUNCTION__ = "__libc_malloc"
#2 0x0000009a300279a0 in ?? ()
No symbol table info available.
#3 0x0000009a3002cad8 in ?? ()
No symbol table info available.
#4 0x0000009a30023794 in ?? ()
No symbol table info available.
#5 0x00007a6150998278 in generic_start_main (main=0x9a30022a00, argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308
        self = 0x7a6150dc38d0
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {8465053667230565969, 134558384812288, 8465057470262718529, 0 <repeats 13 times>, 134558387008032, 0, 134558387008040, 662230455376, 0, 2449962883098869759, 0 <repeats 42 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7fffe93a7700, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = -382044416}}}
        not_first_call = <optimized out>
#6 0x00007a6150998484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
No locals.
#7 0x0000000000000000 in ?? ()
No symbol table info available.
--- source code stack trace ---
#0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950
  [Error: malloc.c was not found in source tree]
#1 __GI___libc_malloc (bytes=16) at malloc.c:3058
  [Error: malloc.c was not found in source tree]
#2 0x0000009a300279a0 in ?? ()
#3 0x0000009a3002cad8 in ?? ()
#4 0x0000009a30023794 in ?? ()
#5 0x00007a6150998278 in generic_start_main (main=0x9a30022a00, argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308
  [Error: libc-start.c was not found in source tree]
#6 0x00007a6150998484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
  [Error: libc-start.c was not found in source tree]
#7 0x0000000000000000 in ?? ()

I thought at first this would be related to my debug rebuilds, but it seems to appear as-is in the version as it is in the Ubuntu Archive.

It seems to be an alloc call with 16 bytes and this is clearly not OOM.
Needs to be analyzed deeper ...

It seems that once this triggers the system is in an overall bad state.
I have seen
a) hang checks in dmesg
b) defunct processes even unrelated ones like
  1 6 5529 4785 20 0 0 0 - Z+ pts/6 0:00 \_ [mandb] <defunct>
1 6 5530 4785 20 0 0 0 - Z+ pts/6 0:00 \_ [mandb] <defunct>

Since this was found while doing an SRU verification I ensured this happens with the pre SRU version as well.

--- stack trace ---
#0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950
        e = 0x532ebf51970
        e = <optimized out>
        __PRETTY_FUNCTION__ = "tcache_get"
#1 __GI___libc_malloc (bytes=16) at malloc.c:3058
        ar_ptr = <optimized out>
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __PRETTY_FUNCTION__ = "__libc_malloc"
#2 0x000005328f5d80dc in ?? ()
No symbol table info available.
#3 0x000005328f5dd148 in ?? ()
No symbol table info available.
#4 0x000005328f5d34dc in ?? ()
No symbol table info available.
#5 0x0000704637e88278 in generic_start_main (main=0x5328f5d2780, argc=<optimized out>, argv=0x7ffff3468348, auxvec=0x7ffff34683a0, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308
        self = 0x7046382b38d0
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1091307981331522811, 123446890164480, -1091302031655606667, 0 <repeats 13 times>, 123446892360224, 0, 123446892360232, 5714711795632, 0, 2449962883098869759, 0 <repeats 42 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7ffff3468220, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = -213482976}}}
        not_first_call = <optimized out>
#6 0x0000704637e88484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
No locals.
#7 0x0000000000000000 in ?? ()
No symbol table info available.
--- source code stack trace ---
#0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950
  [Error: malloc.c was not found in source tree]
#1 __GI___libc_malloc (bytes=16) at malloc.c:3058
  [Error: malloc.c was not found in source tree]
#2 0x000005328f5d80dc in ?? ()
#3 0x000005328f5dd148 in ?? ()
#4 0x000005328f5d34dc in ?? ()
#5 0x0000704637e88278 in generic_start_main (main=0x5328f5d2780, argc=<optimized out>, argv=0x7ffff3468348, auxvec=0x7ffff34683a0, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308
  [Error: libc-start.c was not found in source tree]
#6 0x0000704637e88484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
  [Error: libc-start.c was not found in source tree]
#7 0x0000000000000000 in ?? ()

summary: - crash (on ppc64) hen restarting numad while huge guest is active
+ crash (on ppc64) when restarting numad while huge guest is active
description: updated

I was trying to recreate this on x86 with a 128G guest and 64 CPUs.
I see numad action:

Thu Jul 18 10:51:22 2019: Advising pid 13197 (qemu-system-x86) move from nodes (0-1) to nodes (1)
Thu Jul 18 10:51:23 2019: PID 13197 moved to node(s) 1 in 0.19 seconds

Running stressapptest [1] in Host and guest for a while triggered more of those, without crashes (expected).

Restarting numad did not break it on this system.
A shutdown seems to do a re-evaluation and then go on as usual:
Thu Jul 18 11:00:54 2019: Shutting down numad
Thu Jul 18 11:00:54 2019: Registering numad version 20150602 PID 15629
Thu Jul 18 11:01:01 2019: Advising pid 15500 (stressapptest) move from nodes (0-1) to nodes (0-1)
Thu Jul 18 11:01:01 2019: PID 15500 moved to node(s) 0-1 in 0.0 seconds
Thu Jul 18 11:01:06 2019: Advising pid 13197 (qemu-system-x86) move from nodes (0-1) to nodes (0-1)
Thu Jul 18 11:01:06 2019: PID 13197 moved to node(s) 0-1 in 0.0 seconds

So the assumption for now is that this is either ppc64el specific or even specific to our particular P9 (dradis).

Lowering importance as it seems not to be a general issue.
I'll ping Frank if he wants to reverse mirror that to IBM.

[1]: https://github.com/stressapptest/stressapptest/releases

Changed in numad (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
description: updated
Manoj Iyer (manjo) on 2019-07-18
Changed in numad (Ubuntu):
assignee: nobody → bugproxy (bugproxy)
bugproxy (bugproxy) on 2019-07-18
tags: added: architecture-ppc64le bugnameltc-179340 severity-low targetmilestone-inin---
Frank Heimes (fheimes) on 2019-07-19
Changed in ubuntu-power-systems:
status: New → Confirmed
assignee: nobody → bugproxy (bugproxy)
importance: Undecided → Medium
Frank Heimes (fheimes) on 2019-08-05
tags: added: universe

Now tested on a P8 machine as well, there (at least on ours) it didn't occur for other loads.
The VM based load I used before was blocked by bug 1839065

Frank Heimes (fheimes) on 2019-08-12
tags: added: reverse-proxy-bugzilla
Andrew Cloke (andrew-cloke) wrote :

Marking as "incomplete" while waiting for LP#1839065 to be resolved (with a f/w update).

Changed in ubuntu-power-systems:
status: Confirmed → Incomplete
Andrew Cloke (andrew-cloke) wrote :

Now that we have the firmware updates, changing back to "triage".

Changed in ubuntu-power-systems:
status: Incomplete → Triaged

Rechecked, things are still broken on platforms with non-linear numa zones.
But that isn't a bug in Ubuntu but a bug in upstream.
Worth to mention again, upstream on this project seems dead.

Due to its specific issue only affecting power machines with their numa layout that would IMHO be the HW-enablement team of IBM that drives this effort more than any other involved entity.

We will keep numad as-is for now as there are many other platforms it works (even some ppc64el systems), but the bug is "on IBM".

@Frank can you make sure that it is clear that the action on this is on IBM PPC people?

Once resolved feel free to update this bug, which then opens the opportunity to reconsider SRus for this and bug 1832915.

Changed in numad (Ubuntu Bionic):
status: New → Incomplete
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Andrew Cloke (andrew-cloke) wrote :

Reclassifying as "low" to match "numad" classification.

Changed in ubuntu-power-systems:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers