Bug #1832915 “numad crashes while running kvm guest” : Bugs : numad package : Ubuntu

Revision history for this message

bugproxy (bugproxy) wrote on 2019-06-14: sosreport on host

#1

sosreport on host Edit (9.4 MiB, application/x-xz)

Default Comment by Bridge

tags:

added: architecture-ppc64le bugnameltc-175673 severity-high targetmilestone-inin---

Revision history for this message

bugproxy (bugproxy) wrote on 2019-06-14: gdb output

#2

gdb output Edit (8.8 KiB, text/x-log)

Default Comment by Bridge

Changed in ubuntu:
assignee:	nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects:	ubuntu → numad (Ubuntu)

Manoj Iyer (manjo) on 2019-06-15

Changed in ubuntu-power-systems:
assignee:	nobody → Manoj Iyer (manjo)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#3

On a fresh Bionic running with the latest 4.15.0-51-generic I did the following trying to reproduce this issues.
Note: My Host has 128G mem and 40 cores (SMT off)

1. installed numad
2. started the numad service and verified it runs fine
3. I spawned two Guests with 20 cores and 50G each (since there was no particular guest config mentioned I didn't configure anything special)
   I used uvtool to get the latest cloud image
4. cloned stressapptest from git [1] in the guests
   and installed build-essential
   (my guetss are Bionic and that didn't have stressapptest packaged yet)
   Built and installed the tool
5. ran the stress in both guests as mentioned
     $ stressapptest -s 200

Well actually I was just about to start that load (not yet happened) when I realized my numad process has already died:

● numad.service - numad - The NUMA daemon that manages application locality.
   Loaded: loaded (/lib/systemd/system/numad.service; enabled; vendor preset: enabled)
   Active: failed (Result: core-dump) since Mon 2019-06-17 06:12:31 UTC; 2min 23s ago
     Docs: man:numad
  Process: 119546 ExecStart=/usr/bin/numad $DAEMON_ARGS -i 15 (code=exited, status=0/SUCCESS)
Main PID: 119547 (code=dumped, signal=SEGV)

Jun 17 06:00:28 dradis systemd[1]: Starting numad - The NUMA daemon that manages application locality....
Jun 17 06:00:28 dradis systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 17 06:12:31 dradis systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Jun 17 06:12:31 dradis systemd[1]: numad.service: Failed with result 'core-dump'.

So the mem-stress load might help to trigger it, but isn't necessarily required.
After restarting the numad daemon I started the guest load and got the crash again.

While I have no idea yet what exactly is going on lets set this to confirmed at least.

[1]: https://github.com/stressapptest/stressapptest

Initially had PID 119547 and no odd entries in the log.

On a fresh Bionic running with the latest 4.15.0-51-generic I did the following trying to reproduce this issues.
Note: My Host has 128G mem and 40 cores (SMT off)

1. installed numad
2. started the numad service and verified it runs fine
3. I spawned two Guests with 20 cores and 50G each (since there was no particular guest config mentioned I didn't configure anything special)
   I used uvtool to get the latest cloud image 
4. cloned stressapptest from git [1] in the guests
   and installed build-essential
   (my guetss are Bionic and that didn't have stressapptest packaged yet)
   Built and installed the tool
5. ran the stress in both guests as mentioned
     $ stressapptest -s 200

Well actually I was just about to start that load (not yet happened) when I realized my numad process has already died:

● numad.service - numad - The NUMA daemon that manages application locality.
   Loaded: loaded (/lib/systemd/system/numad.service; enabled; vendor preset: enabled)
   Active: failed (Result: core-dump) since Mon 2019-06-17 06:12:31 UTC; 2min 23s ago
     Docs: man:numad
  Process: 119546 ExecStart=/usr/bin/numad $DAEMON_ARGS -i 15 (code=exited, status=0/SUCCESS)
 Main PID: 119547 (code=dumped, signal=SEGV)

Jun 17 06:00:28 dradis systemd[1]: Starting numad - The NUMA daemon that manages application locality....
Jun 17 06:00:28 dradis systemd[1]: Started numad - The NUMA daemon that manages application locality..
Jun 17 06:12:31 dradis systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Jun 17 06:12:31 dradis systemd[1]: numad.service: Failed with result 'core-dump'.

So the mem-stress load might help to trigger it, but isn't necessarily required.
After restarting the numad daemon I started the guest load and got the crash again.

While I have no idea yet what exactly is going on lets set this to confirmed at least.

[1]: https://github.com/stressapptest/stressapptest

Initially had PID 119547 and no odd entries in the log.

Changed in numad (Ubuntu):
status:	New → Confirmed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#4

Download full text (4.9 KiB)

With verbose my numad log file is:

Mon Jun 17 06:22:53 2019: Nodes: 2
Min CPUs free: 1416, Max CPUs: 1423, Avg CPUs: 1419, StdDev: 3.53553
Min MBs free: 12869, Max MBs: 13756, Avg MBs: 13312, StdDev: 443.5
Node 0: MBs_total 65266, MBs_free 12869, CPUs_total 2000, CPUs_free 1416, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 13756, CPUs_total 2000, CPUs_free 1423, Distance: 40 10 CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:22:53 2019: Processes: 1563
Mon Jun 17 06:22:53 2019: Candidates: 2
101867853: PID 120072: (qemu-system-ppc), Threads 23, MBs_size 55763, MBs_used 50509, CPUs_used 876, Magnitude 44245884, Nodes: 0,8
101867853: PID 120206: (qemu-system-ppc), Threads 23, MBs_size 55821, MBs_used 23699, CPUs_used 279, Magnitude 6612021, Nodes: 0,8
Mon Jun 17 06:22:53 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

With debug the dying message looked like:

Another run #2:
Mon Jun 17 06:25:08 2019: Nodes: 2
Min CPUs free: 302, Max CPUs: 439, Avg CPUs: 370, StdDev: 68.5018
Min MBs free: 1597, Max MBs: 4548, Avg MBs: 3072, StdDev: 1475.5
Node 0: MBs_total 65266, MBs_free 1597, CPUs_total 2000, CPUs_free 302, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 4548, CPUs_total 2000, CPUs_free 439, Distance: 40 10 CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:25:08 2019: Processes: 1572
Mon Jun 17 06:25:08 2019: Candidates: 2
101881395: PID 120072: (qemu-system-ppc), Threads 25, MBs_size 55763, MBs_used 50523, CPUs_used 1995, Magnitude 100793385, Nodes: 0,8
101881395: PID 120206: (qemu-system-ppc), Threads 25, MBs_size 55821, MBs_used 45916, CPUs_used 830, Magnitude 38110280, Nodes: 0,8
Mon Jun 17 06:25:08 2019: PICK NODES FOR: PID: 120072, CPUs 2347, MBs 59438
Mon Jun 17 06:25:08 2019: PROCESS_MBs[0]: 17481
Mon Jun 17 06:25:08 2019: Node[0]: mem: 201700 cpu: 5952
Mon Jun 17 06:25:08 2019: Node[1]: mem: 45480 cpu: 2634
Mon Jun 17 06:25:08 2019: Totmag[0]: 12080055
Mon Jun 17 06:25:08 2019: Totmag[1]: 1948267
Mon Jun 17 06:25:08 2019: best_node_ix: 0
Mon Jun 17 06:25:08 2019: Node: 0 Dist: 10 Magnitude: 1200518400
Mon Jun 17 06:25:08 2019: Node: 8 Dist: 40 Magnitude: 119794320
Mon Jun 17 06:25:08 2019: MBs: 59438, CPUs: 2347
Mon Jun 17 06:25:08 2019: Assigning resources from node 0
Mon Jun 17 06:25:08 2019: Node[0]: mem: 1000 cpu: 0
Mon Jun 17 06:25:08 2019: MBs: 39368, CPUs: 1355
Mon Jun 17 06:25:08 2019: Assigning resources from node 1
Mon Jun 17 06:25:08 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

Another run #3:
Mon Jun 17 06:26:46 2019: Nodes: 2
Min CPUs free: 889, Max CPUs: 1048, Avg CPUs: 968, StdDev: 79.5016
Min MBs free: 1291, Max MBs: 3484, Avg MBs: 2387, StdDev: 1096.5
Node 0: MBs_total 65266, MBs_free 1291, CPUs_total 2000, CPUs_free 889, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 3484, CPUs_total 2000, CPUs_free 104...

With verbose my numad log file is:

Mon Jun 17 06:22:53 2019: Nodes: 2
Min CPUs free: 1416, Max CPUs: 1423, Avg CPUs: 1419, StdDev: 3.53553
Min MBs free: 12869, Max MBs: 13756, Avg MBs: 13312, StdDev: 443.5
Node 0: MBs_total 65266, MBs_free  12869, CPUs_total 2000, CPUs_free 1416,  Distance: 10 40  CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free  13756, CPUs_total 2000, CPUs_free 1423,  Distance: 40 10  CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:22:53 2019: Processes: 1563
Mon Jun 17 06:22:53 2019: Candidates: 2
101867853: PID 120072: (qemu-system-ppc), Threads 23, MBs_size  55763, MBs_used  50509, CPUs_used  876, Magnitude 44245884, Nodes: 0,8
101867853: PID 120206: (qemu-system-ppc), Threads 23, MBs_size  55821, MBs_used  23699, CPUs_used  279, Magnitude 6612021, Nodes: 0,8
Mon Jun 17 06:22:53 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

With debug the dying message looked like:

Another run #2:
Mon Jun 17 06:25:08 2019: Nodes: 2
Min CPUs free: 302, Max CPUs: 439, Avg CPUs: 370, StdDev: 68.5018
Min MBs free: 1597, Max MBs: 4548, Avg MBs: 3072, StdDev: 1475.5
Node 0: MBs_total 65266, MBs_free   1597, CPUs_total 2000, CPUs_free  302,  Distance: 10 40  CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free   4548, CPUs_total 2000, CPUs_free  439,  Distance: 40 10  CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:25:08 2019: Processes: 1572
Mon Jun 17 06:25:08 2019: Candidates: 2
101881395: PID 120072: (qemu-system-ppc), Threads 25, MBs_size  55763, MBs_used  50523, CPUs_used 1995, Magnitude 100793385, Nodes: 0,8
101881395: PID 120206: (qemu-system-ppc), Threads 25, MBs_size  55821, MBs_used  45916, CPUs_used  830, Magnitude 38110280, Nodes: 0,8
Mon Jun 17 06:25:08 2019: PICK NODES FOR:  PID: 120072,  CPUs 2347,  MBs 59438
Mon Jun 17 06:25:08 2019: PROCESS_MBs[0]: 17481
Mon Jun 17 06:25:08 2019:     Node[0]: mem: 201700  cpu: 5952
Mon Jun 17 06:25:08 2019:     Node[1]: mem: 45480  cpu: 2634
Mon Jun 17 06:25:08 2019: Totmag[0]: 12080055
Mon Jun 17 06:25:08 2019: Totmag[1]: 1948267
Mon Jun 17 06:25:08 2019: best_node_ix: 0
Mon Jun 17 06:25:08 2019: Node: 0  Dist: 10  Magnitude: 1200518400
Mon Jun 17 06:25:08 2019: Node: 8  Dist: 40  Magnitude: 119794320
Mon Jun 17 06:25:08 2019: MBs: 59438,  CPUs: 2347
Mon Jun 17 06:25:08 2019: Assigning resources from node 0
Mon Jun 17 06:25:08 2019:     Node[0]: mem: 1000  cpu: 0
Mon Jun 17 06:25:08 2019: MBs: 39368,  CPUs: 1355
Mon Jun 17 06:25:08 2019: Assigning resources from node 1
Mon Jun 17 06:25:08 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

Another run #3:
Mon Jun 17 06:26:46 2019: Nodes: 2
Min CPUs free: 889, Max CPUs: 1048, Avg CPUs: 968, StdDev: 79.5016
Min MBs free: 1291, Max MBs: 3484, Avg MBs: 2387, StdDev: 1096.5
Node 0: MBs_total 65266, MBs_free   1291, CPUs_total 2000, CPUs_free  889,  Distance: 10 40  CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free   3484, CPUs_total 2000, CPUs_free 1048,  Distance: 40 10  CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:26:46 2019: Processes: 1546
Mon Jun 17 06:26:46 2019: Candidates: 2
101891156: PID 120072: (qemu-system-ppc), Threads 23, MBs_size  55763, MBs_used  50593, CPUs_used 1437, Magnitude 72702141, Nodes: 0,8
101891156: PID 120206: (qemu-system-ppc), Threads 23, MBs_size  55821, MBs_used  48065, CPUs_used  613, Magnitude 29463845, Nodes: 0,8
Mon Jun 17 06:26:46 2019: PICK NODES FOR:  PID: 120072,  CPUs 1690,  MBs 59521
Mon Jun 17 06:26:46 2019: PROCESS_MBs[0]: 17527
Mon Jun 17 06:26:46 2019:     Node[0]: mem: 199130  cpu: 8316
Mon Jun 17 06:26:46 2019:     Node[1]: mem: 34840  cpu: 6288
Mon Jun 17 06:26:46 2019: Totmag[0]: 16559650
Mon Jun 17 06:26:46 2019: Totmag[1]: 2190739
Mon Jun 17 06:26:46 2019: best_node_ix: 0
Mon Jun 17 06:26:46 2019: Node: 0  Dist: 10  Magnitude: 1655965080
Mon Jun 17 06:26:46 2019: Node: 8  Dist: 40  Magnitude: 219073920
Mon Jun 17 06:26:46 2019: MBs: 59521,  CPUs: 1690
Mon Jun 17 06:26:46 2019: Assigning resources from node 0
Mon Jun 17 06:26:46 2019:     Node[0]: mem: 1000  cpu: 0
Mon Jun 17 06:26:46 2019: MBs: 39708,  CPUs: 304
Mon Jun 17 06:26:46 2019: Assigning resources from node 1
Mon Jun 17 06:26:46 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

Your crash was around:
Thu Feb 21 00:12:10 2019: Assigning resources from node 5
Thu Feb 21 00:12:10 2019: Assigning resources from node 2
Thu Feb 21 00:12:10 2019: Process 88781 already 100 percent localized to target nodes.

Mine seems to be as soon as it hits "Assigning resources" as well.
This is something the daemon will do anyway, but obviously more often with actual memory load.
So far all fits together, lets try to find what it accesses when failing.

Frank Heimes (fheimes) on 2019-06-17

Changed in ubuntu-power-systems:
status:	New → Confirmed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#5

# get debug symbols and gdb
$ sudo apt install numad-dbgsym gdb dpkg-dev
# get source as used in the package
$ apt source numad
# I found that we will also need glibc source, so:
$ apt source glibc

It helps to add paths to gdb
(gdb) directory /home/ubuntu/numad-0.5+20150602:/home/ubuntu/glibc-2.27:/home/ubuntu/glibc-2.27/malloc

I found these backtraces:

Thread 1 "numad" received signal SIGSEGV, Segmentation fault.
tcache_get (tc_idx=<optimized out>) at malloc.c:2943
2943 malloc.c: No such file or directory.
(gdb) bt
#0 tcache_get (tc_idx=<optimized out>) at malloc.c:2943
#1 __GI___libc_malloc (bytes=16) at malloc.c:3050
#2 0x00000d9b7ec780dc in bind_process_and_migrate_memory (p=0xd9b843b0f70) at numad.c:993
#3 0x00000d9b7ec7d148 in manage_loads () at numad.c:2225
#4 0x00000d9b7ec734dc in main (argc=<optimized out>, argv=<optimized out>) at numad.c:2654

(gdb) bt
#0 0x00000fb6cd2779f4 in bind_process_and_migrate_memory (p=0xfb6fc1e0f70) at numad.c:998
#1 0x00000fb6cd27d148 in manage_loads () at numad.c:2225
#2 0x00000fb6cd2734dc in main (argc=<optimized out>, argv=<optimized out>) at numad.c:2654

(gdb) bt
#0 0x000001c757da79f4 in bind_process_and_migrate_memory (p=0x1c758a60f70) at numad.c:998
#1 0x000001c757dad148 in manage_loads () at numad.c:2225
#2 0x000001c757da34dc in main (argc=<optimized out>, argv=<optimized out>) at numad.c:2654

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#6

One fail was at:
CLEAR_CPU_LIST(cpu_bind_list_p);
The next two at:
OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);

The common denominator here is cpu_bind_list_p but that is a static local:
static id_list_p cpu_bind_list_p;

The function is defined as:
#define OR_LISTS( or_list_p, list_1_p, list_2_p) CPU_OR_S( or_list_p->bytes, or_list_p->set_p, list_1_p->set_p, list_2_p->set_p)

That translates into
CPU_OR_S( cpu_bind_list_p->bytes, cpu_bind_list_p->set_p, cpu_bind_list_p->set_p, node[node_id].cpu_list_p->set_p)

CPU_OR_S is from sched.h and will make it to:
- operate on the dynamically allocated CPU set(s) whose size is setsize bytes. (due to _S)
- Store the union of the sets cpu_bind_list_p->set_p and node[node_id].cpu_list_p->set_p in destset
- explicitly says dest "may be one of the source sets"

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#7

(gdb) p cpu_bind_list_p->bytes
$5 = 24
(gdb) p *(cpu_bind_list_p->set_p)
$7 = {__bits = {1229782938247303441, 4369, 0, 49, 1955697441360, 274, 303148778372988952, 139284342967816, 48, 337, 1955697440432, 1955697400112, 1955697400080, 1955697400048,
1955697400016, 1955697400512}}
(gdb) p sizeof(*(cpu_bind_list_p->set_p))
$8 = 128

See the size mismatch?
It will allocate 24 bytes and needs 128.

I think this is a bad intialization.
We have these operations in the code on cpu_bind_list_p.

static id_list_p cpu_bind_list_p;
CLEAR_CPU_LIST(cpu_bind_list_p);
OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);

Now CLEAR_CPU_LIST has some init code, but only if == NULL.
#define CLEAR_CPU_LIST(list_p) \
    if (list_p == NULL) { \
        INIT_ID_LIST(list_p, num_cpus); \
    } \
    CPU_ZERO_S(list_p->bytes, list_p->set_p)

Since we can't rely on data in that static var "by accident" it might have stale old data.

Note: The other chance of errors is the 40 active CPUs vs the 160 potential CPUs (SMT off) that I have in my system.
The size is from num_cpu - if that detection is off then it might fail as well.
But at least in all my crashes that was ok.
(gdb) p num_cpus
$10 = 160

So lets assume it is the lack of (re)initialization for now.
Other structures of type "id_list_p" are all initialized with NULL btw.
Like:
  id_list_p all_cpus_list_p = NULL;
  id_list_p all_nodes_list_p = NULL;
  id_list_p reserved_cpu_mask_list_p = NULL;

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#8

Note:
- numad is "only" in universe in all releases
- nothing depends on it
- it is on 0.5+20150602-5 which seems rather old
- But upstream commits [1] since 2015 are minimal

TL;DR no (somewhat dead) upstream fix for that available to cherry-pick

Hmm, the above might have been a red herring.
I get 24 bytes size even in cases where the initial value was null.

Maybe the pure size of the set_p isn't waht matters.
In any case lets do a non-optimized build to get around debugging issues like:

(gdb) p node[node_id].cpu_list_pnode[node_id].cpu_list_p->bytes
value has been optimized out

Also I realized that this is a static var not only to be local but in the common sense to keep content across function calls. So initializing on init will be wrong :-)

[1] https://pagure.io/numad

Frank Heimes (fheimes) on 2019-06-17

tags:

added: universe

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#9

While I built a proper PPA in [1] this seems so trivial that we can rebuild locally with just

$ cc -std=gnu99 -I. -D__thread="" -c -o numad.o numad.c
$ cc numad.o -lpthread -lrt -lm -o numad
$ mv numad /usr/bin/numad

That should allow quick iterations.

With debug enabled I found that the second set is actually null.
So the red herring assumption above was correct.

996 while (nodes) {
997 if (ID_IS_IN_LIST(node_id, p->node_list_p)) {
998 OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);
(gdb) p node[node_id].cpu_list_p
$4 = (id_list_p) 0x0

The arg is
(gdb) p *(p->node_list_p)
$7 = {set_p = 0x304a3a418d0, bytes = 8}

This delivers "1"
int nodes = NUM_IDS_IN_LIST(p->node_list_p);
(gdb) p nodes
$5 = 1

That is:
#define NUM_IDS_IN_LIST(list_p) CPU_COUNT_S(list_p->bytes, list_p->set_p)

Per [2] this counts the cpus in the cpu_set.

So the TL;DR of this loop
while (nodes) {
nodes -= 1;
is that it iterates over all CPUs

On the each iteration it checks
if (ID_IS_IN_LIST(node_id, p->node_list_p)) {

node_id starts at zero and is incremented each iteration.

I must admit the usage of the term "node" for cpus here is very misleading.

"node" is a global data structure

typedef struct node_data {
    uint64_t node_id;
    uint64_t MBs_total;
    uint64_t MBs_free;
    uint64_t CPUs_total; // scaled * ONE_HUNDRED
    uint64_t CPUs_free; // scaled * ONE_HUNDRED
    uint64_t magnitude; // hack: MBs * CPUs
    uint8_t *distance;
    id_list_p cpu_list_p;
} node_data_t, *node_data_p;
node_data_p node = NULL;

Due to the misperception of "node" actually being CPUs the indexing here is off IMHO.

(gdb) p node[0]
$13 = {node_id = 0, MBs_total = 65266, MBs_free = 1510, CPUs_total = 2000, CPUs_free = 1144, magnitude = 1727440, distance = 0x304a3a41850 "\n(\032\n\244~", cpu_list_p = 0x304a3a41810}
(gdb) p node[1]
$14 = {node_id = 8, MBs_total = 65337, MBs_free = 1734, CPUs_total = 2000, CPUs_free = 1049, magnitude = 1818966, distance = 0x304a3a418b0 "(\n\032\n\244~", cpu_list_p = 0x304a3a41870}

My CPUs are 0,4,8,... and so is the indexing here as despite the node name it is actually based on CPUs.

Summary:
- The code checks for each CPU as counted by NUM_IDS_IN_LIST
- It will increase the ID until it found a hit in ID_IS_IN_LIST(node_id, p->node_list_p)
- that will skip empty CPUs as in my SMT case
- Once it found a cpu that is in the set it will OR_LISTS
node[node_id].cpu_list_p

While I built a proper PPA in [1] this seems so trivial that we can rebuild locally with just

$ cc -std=gnu99  -I. -D__thread=""   -c -o numad.o numad.c
$ cc   numad.o  -lpthread -lrt -lm -o numad
$ mv numad /usr/bin/numad

That should allow quick iterations.

With debug enabled I found that the second set is actually null.
So the red herring assumption above was correct.

996         while (nodes) {
997             if (ID_IS_IN_LIST(node_id, p->node_list_p)) {
998                 OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);
(gdb) p node[node_id].cpu_list_p
$4 = (id_list_p) 0x0

The arg is
(gdb) p *(p->node_list_p)
$7 = {set_p = 0x304a3a418d0, bytes = 8}

This delivers "1"
int nodes = NUM_IDS_IN_LIST(p->node_list_p);
(gdb) p nodes
$5 = 1

That is:
#define NUM_IDS_IN_LIST(list_p)     CPU_COUNT_S(list_p->bytes, list_p->set_p)

Per [2] this counts the cpus in the cpu_set.

So the TL;DR of this loop
while (nodes) {
  nodes -= 1;
is that it iterates over all CPUs

On the each iteration it checks
  if (ID_IS_IN_LIST(node_id, p->node_list_p)) {

node_id starts at zero and is incremented each iteration.

I must admit the usage of the term "node" for cpus here is very misleading.

"node" is a global data structure

typedef struct node_data {
    uint64_t node_id;
    uint64_t MBs_total;
    uint64_t MBs_free;
    uint64_t CPUs_total; // scaled * ONE_HUNDRED
    uint64_t CPUs_free;  // scaled * ONE_HUNDRED
    uint64_t magnitude;  // hack: MBs * CPUs
    uint8_t *distance;
    id_list_p cpu_list_p;  
} node_data_t, *node_data_p; 
node_data_p node = NULL;

Due to the misperception of "node" actually being CPUs the indexing here is off IMHO.

(gdb) p node[0]
$13 = {node_id = 0, MBs_total = 65266, MBs_free = 1510, CPUs_total = 2000, CPUs_free = 1144, magnitude = 1727440, distance = 0x304a3a41850 "\n(\032\n\244~", cpu_list_p = 0x304a3a41810}
(gdb) p node[1]
$14 = {node_id = 8, MBs_total = 65337, MBs_free = 1734, CPUs_total = 2000, CPUs_free = 1049, magnitude = 1818966, distance = 0x304a3a418b0 "(\n\032\n\244~", cpu_list_p = 0x304a3a41870}

My CPUs are 0,4,8,... and so is the indexing here as despite the node name it is actually based on CPUs.

Summary:
- The code checks for each CPU as counted by NUM_IDS_IN_LIST
- It will increase the ID until it found a hit in ID_IS_IN_LIST(node_id, p->node_list_p)
- that will skip empty CPUs as in my SMT case
- Once it found a cpu that is in the set it will OR_LISTS
  node[node_id].cpu_list_p

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#10

[1]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1832915-numad-debugging
[2]: https://linux.die.net/man/3/cpu_count_s

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#11

The problem is that node[node_id].cpu_list_p is wrong.
When you look at the array again it has two real entries and nothing more:

(gdb) p node[0]
$20 = {node_id = 0, MBs_total = 65266, MBs_free = 1510, CPUs_total = 2000, CPUs_free = 1144, magnitude = 1727440, distance = 0x304a3a41850 "\n(\032\n\244~", cpu_list_p = 0x304a3a41810}
(gdb) p node[1]
$21 = {node_id = 8, MBs_total = 65337, MBs_free = 1734, CPUs_total = 2000, CPUs_free = 1049, magnitude = 1818966, distance = 0x304a3a418b0 "(\n\032\n\244~", cpu_list_p = 0x304a3a41870}
(gdb) p node[2]
$22 = {node_id = 1820693536, MBs_total = 33, MBs_free = 3318460192688, CPUs_total = 24, CPUs_free = 1839495593, magnitude = 33,
  distance = 0x1111111111111111 <error: Cannot access memory at address 0x1111111111111111>, cpu_list_p = 0x1111111111111111}
(gdb) p node[3]
$23 = {node_id = 286331153, MBs_total = 33, MBs_free = 3318460192752, CPUs_total = 8, CPUs_free = 1842299472, magnitude = 33,
  distance = 0x101 <error: Cannot access memory at address 0x101>, cpu_list_p = 0x7ea40a1a0e08 <main_arena+96>}
(gdb) p node[4]
$24 = {node_id = 1867659328, MBs_total = 33, MBs_free = 3318460192816, CPUs_total = 24, CPUs_free = 1882184320, magnitude = 33,
  distance = 0x1111111111111111 <error: Cannot access memory at address 0x1111111111111111>, cpu_list_p = 0x1111}
(gdb) p node[5]
$25 = {node_id = 0, MBs_total = 33, MBs_free = 139243009222666, CPUs_total = 139243009216008, CPUs_free = 1898775676, magnitude = 33, distance = 0x304a3a41890 "", cpu_list_p = 0x18}
(gdb) p node[6]
$26 = {node_id = 3689421645304561696, MBs_total = 33, MBs_free = 0, CPUs_total = 1229782938247299072, CPUs_free = 286331153, magnitude = 33,
  distance = 0x7ea40a1a0a28 <_IO_wide_data_2+264> "", cpu_list_p = 0x7ea40a1a0e08 <main_arena+96>}
(gdb) p node[7]
$27 = {node_id = 3546150882158837792, MBs_total = 33, MBs_free = 257, CPUs_total = 267, CPUs_free = 288230377091498008, magnitude = 33, distance = 0x304a3a41910 "\001\001",
  cpu_list_p = 0x8}
(gdb) p node[8]
$28 = {node_id = 303211223003168792, MBs_total = 33, MBs_free = 257, CPUs_total = 265, CPUs_free = 288230377024389144, magnitude = 33,
  distance = 0x2f69 <error: Cannot access memory at address 0x2f69>, cpu_list_p = 0x0}

We essentially do an out of bounds to the array at index [8] where cpu_list_p = 0x0 and that triggers the SEGV

We actually do NOT want node[node_id]

Instead we'd need to iterate the node array entries, and pick that entry which has nodes[x].node_id == node_id.

The problem is that node[node_id].cpu_list_p is wrong.
When you look at the array again it has two real entries and nothing more:

(gdb) p node[0]
$20 = {node_id = 0, MBs_total = 65266, MBs_free = 1510, CPUs_total = 2000, CPUs_free = 1144, magnitude = 1727440, distance = 0x304a3a41850 "\n(\032\n\244~", cpu_list_p = 0x304a3a41810}
(gdb) p node[1]
$21 = {node_id = 8, MBs_total = 65337, MBs_free = 1734, CPUs_total = 2000, CPUs_free = 1049, magnitude = 1818966, distance = 0x304a3a418b0 "(\n\032\n\244~", cpu_list_p = 0x304a3a41870}
(gdb) p node[2]
$22 = {node_id = 1820693536, MBs_total = 33, MBs_free = 3318460192688, CPUs_total = 24, CPUs_free = 1839495593, magnitude = 33, 
  distance = 0x1111111111111111 <error: Cannot access memory at address 0x1111111111111111>, cpu_list_p = 0x1111111111111111}
(gdb) p node[3]
$23 = {node_id = 286331153, MBs_total = 33, MBs_free = 3318460192752, CPUs_total = 8, CPUs_free = 1842299472, magnitude = 33, 
  distance = 0x101 <error: Cannot access memory at address 0x101>, cpu_list_p = 0x7ea40a1a0e08 <main_arena+96>}
(gdb) p node[4]
$24 = {node_id = 1867659328, MBs_total = 33, MBs_free = 3318460192816, CPUs_total = 24, CPUs_free = 1882184320, magnitude = 33, 
  distance = 0x1111111111111111 <error: Cannot access memory at address 0x1111111111111111>, cpu_list_p = 0x1111}
(gdb) p node[5]
$25 = {node_id = 0, MBs_total = 33, MBs_free = 139243009222666, CPUs_total = 139243009216008, CPUs_free = 1898775676, magnitude = 33, distance = 0x304a3a41890 "", cpu_list_p = 0x18}
(gdb) p node[6]
$26 = {node_id = 3689421645304561696, MBs_total = 33, MBs_free = 0, CPUs_total = 1229782938247299072, CPUs_free = 286331153, magnitude = 33, 
  distance = 0x7ea40a1a0a28 <_IO_wide_data_2+264> "", cpu_list_p = 0x7ea40a1a0e08 <main_arena+96>}
(gdb) p node[7]
$27 = {node_id = 3546150882158837792, MBs_total = 33, MBs_free = 257, CPUs_total = 267, CPUs_free = 288230377091498008, magnitude = 33, distance = 0x304a3a41910 "\001\001", 
  cpu_list_p = 0x8}
(gdb) p node[8]
$28 = {node_id = 303211223003168792, MBs_total = 33, MBs_free = 257, CPUs_total = 265, CPUs_free = 288230377024389144, magnitude = 33, 
  distance = 0x2f69 <error: Cannot access memory at address 0x2f69>, cpu_list_p = 0x0}

We essentially do an out of bounds to the array at index [8] where cpu_list_p = 0x0 and that triggers the SEGV

We actually do NOT want node[node_id]

Instead we'd need to iterate the node array entries, and pick that entry which has nodes[x].node_id == node_id.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#12

Chances are that without the odd SMT=off numbering on ppc things would work.
That might explain why this didn't fail more often or on other architectures so far.

But disabling subset of CPUs is allowed, so this needs to be fixed for all - no matter how "often" an issue occurs on one of the architectures.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#13

Rebuild via:
rm numad
cc -g -O0 -fstack-protector-strong -std=gnu99 -I. -D__thread="" -Wdate-time -D_FORTIFY_SOURCE=2 -c -o numad.o numad.c cc -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now numad.o -lpthread -lrt -lm -o numad
ls -laF numad
sudo mv numad /usr/bin/numad

My current config triggering this has a pretty common CPU list on ppc64el:

CPU(s): 160
On-line CPU(s) list: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Off-line CPU(s) list: 1-3,5-7,9-11,13-15,17-19,21-23,25-27,29-31,33-35,37-39,41-43,45-47,49-51,53-55,57-59,61-63,65-67,69-71,73-75,77-79,81-83,85-87,89-91,93-95,97-99,101-103,105-107,109-111,113-115,117-119,121-123,125-127,129-131,133-135,137-139,141-143,145-147,149-151,153-155,157-159

The assumption seems to be correct, it was due to that cpu/node mismatch assuming always linear CPUs with cpu-number == index-in-array.

With the following change the breakage no more happens in my setup:

--- numad.c.orig 2019-06-17 09:27:49.783712059 +0000
+++ numad.c 2019-06-17 10:11:00.619113441 +0000
@@ -995,7 +995,18 @@
     int node_id = 0;
     while (nodes) {
         if (ID_IS_IN_LIST(node_id, p->node_list_p)) {
- OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);
+ int id = -1;
+ for (int node_ix = 0; (node_ix < num_nodes); node_ix++) {
+ if (node[node_ix].node_id == node_id) {
+ id = node_ix;
+ break;
+ }
+ }
+ if (id == -1) {
+ numad_log(LOG_CRIT, "Node %d is requested, but unknown\n", node_id);
+ exit(EXIT_FAILURE);
+ }
+ OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[id].cpu_list_p);
             nodes -= 1;
         }
         node_id += 1;

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#14

While this numbering is pretty common at power (all non SMT systems) and s390x (scaling #cpus on load) it is uncommon on x86. Never the less in theory the issue should exist there as well
But I tried this for an hour and it didn't trigger (plenty of assigns happened)

Repro (x86)
1. Get a KVM guest with numa memory nodes
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cpu>
    <numa>
      <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>
    </numa>
  </cpu>
2. disable some cpus in the mid
  $ echo 0 | sudo tee /sys/bus/cpu/devices/cpu1/online
  $ echo 0 | sudo tee /sys/bus/cpu/devices/cpu2/online
  $ lscpu
  CPU(s): 4
  On-line CPU(s) list: 0,3
  Off-line CPU(s) list: 1,2
3. install, start and follow the log of numad
  $ sudo apt install numad
  $ sudo systemctl start numad
  $ journalctl -f -u numad
4. run some memory load that will make numad assign processes
  $ sudo apt install stress-ng
  $ stress-ng --vm 2 --vm-bytes 90% -t 5m

If we follow the log of numad with verbose enabled we will after a while see numa assignments like:
Mon Jun 17 10:32:05 2019: Advising pid 3416 (stress-ng-vm) move from nodes (0-1) to nodes (0)
Mon Jun 17 10:32:23 2019: Advising pid 3417 (stress-ng-vm) move from nodes (0-1) to nodes (1)

Maybe on ppc also the numa node numbering is non linear, I remember working on fixes for numactl in that regard - and maybe that is important as well.

While this numbering is pretty common at power (all non SMT systems) and s390x (scaling #cpus on load) it is uncommon on x86. Never the less in theory the issue should exist there as well
But I tried this for an hour and it didn't trigger (plenty of assigns happened)

Repro (x86)
1. Get a KVM guest with numa memory nodes
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cpu>
    <numa>
      <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>                                                                                                                                   
      <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>                                                                                                                                   
    </numa>                                                                                                                                                                                   
  </cpu> 
2. disable some cpus in the mid
  $ echo 0 | sudo tee  /sys/bus/cpu/devices/cpu1/online
  $ echo 0 | sudo tee  /sys/bus/cpu/devices/cpu2/online
  $ lscpu
  CPU(s):               4
  On-line CPU(s) list:  0,3
  Off-line CPU(s) list: 1,2
3. install, start and follow the log of numad
  $ sudo apt install numad
  $ sudo systemctl start numad
  $ journalctl -f -u numad
4. run some memory load that will make numad assign processes
  $ sudo apt install stress-ng
  $ stress-ng --vm 2 --vm-bytes 90% -t 5m

If we follow the log of numad with verbose enabled we will after a while see numa assignments like:
Mon Jun 17 10:32:05 2019: Advising pid 3416 (stress-ng-vm) move from nodes (0-1) to nodes (0)
Mon Jun 17 10:32:23 2019: Advising pid 3417 (stress-ng-vm) move from nodes (0-1) to nodes (1)

Maybe on ppc also the numa node numbering is non linear, I remember working on fixes for numactl in that regard - and maybe that is important as well.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#15

I have made a test build with the fix available at PPA [1]. It resolves the issue for me, but before going further please give that a try with your setups as well.

Further I opened a PR for upstream at [2] to discuss it there as well.
Feel free to chime in and give it a +1 there if it works well for you.

[1]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1832915-numad-debugging
[2]: https://pagure.io/numad/pull-request/3

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-17:

#16

@JFH/Manjo - the bug assignment is odd can you please set it up the way you need it to reflect that we are waiting on Upstream (ack on PR) and IBM (test PPA) ?

Manoj Iyer (manjo) on 2019-06-17

Changed in numad (Ubuntu):
assignee:	Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Server Team (canonical-server)
Changed in ubuntu-power-systems:
assignee:	Manoj Iyer (manjo) → Canonical Server Team (canonical-server)
status:	Confirmed → Incomplete
Changed in numad (Ubuntu):
status:	Confirmed → Incomplete
Changed in ubuntu-power-systems:
importance:	Undecided → High
Changed in numad (Ubuntu):
importance:	Undecided → High

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-19:

#17

Reported to Debian (linked above) and prepared an MP for Eoan for team review.

But still waiting for your ok @IBM that this solves your case.

Revision history for this message

bugproxy (bugproxy) wrote on 2019-06-20: Comment bridged from LTC Bugzilla

#18

------- Comment From <email address hidden> 2019-06-20 07:00 EDT-------
(In reply to comment #22)
> Reported to Debian (linked above) and prepared an MP for Eoan for team
> review.
>
> But still waiting for your ok @IBM that this solves your case.

It does not solve the issue.

> Updated numad from ppa:

# dpkg -l | grep numad
ii numad 0.5+20150602-5ubuntu0.1~ppa2 ppc64el User-level daemon that monitors NUMA topology and usage

# service numad status
? numad.service - numad - The NUMA daemon that manages application locality.
Loaded: loaded (/lib/systemd/system/numad.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-06-20 06:50:13 EDT; 1s ago
Docs: man:numad
Process: 13844 ExecStart=/usr/bin/numad $DAEMON_ARGS -i 15 (code=exited, status=0/SUCCESS)
Main PID: 13845 (numad)
CGroup: /system.slice/numad.service
??13845 /usr/bin/numad -i 15

> now started a KVM guest, ran `stress-ng -c 14 -vm 10` inside guest

> in few minutes numad crashed on host

Jun 20 06:56:15 localhost kernel: [ 2916.371332] numad[13845]: segfault (11) at 1b1ea0308 nip 7fffb56cac20 lr 7fffb56cf85c code 1 in libc-2.27.so[7fffb5620000+210000]
Jun 20 06:56:15 localhost kernel: [ 2916.371352] numad[13845]: code: 3be30058 38c30010 f821ffc1 91230008 38000000 7c2004ac 7d2030a9 7c0031ad
Jun 20 06:56:15 localhost kernel: [ 2916.371354] numad[13845]: code: 40c2fff8 4c00012c 2fa90000 419e0144 <e9090008> 550ae13e 394afffe 794a1f48
Jun 20 06:56:16 localhost systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Jun 20 06:56:16 localhost systemd[1]: numad.service: Failed with result 'core-dump'.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-06-21:

#19

Interesting, for me the issue was no more reproducible with the fix applied.

Maybe there is another bug in the same code that you hit now.
Could you tell me all details about the involved setup in triggering this crash still?

Further this should have created a crash dump /var/crash/.
Probably best is to clean this up, let it crash again and then attach the crash here so I can take a look at where/why it might crash still for you.

Bug Watch Updater (bug-watch-updater) on 2019-06-21

Changed in numad (Debian):
status:	Unknown → New

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-02:

#20

Hi,
any updates on this one?

All I could reproduce would be fixed with the suggested change, but since according to you that isn't sufficient I now need you to debug your case and/suggest add whatever change on top that you need.

After fixing the bug that I could identify I'd hate if this goes into "waiting forever" for some extra issue that you have with it.

This is incomplete until further info is proviede, you can
a) provide info how this might be reproducible for me as well
b) provide patches that fix your issue
c) like the fix that I have to at least fix some issues - we can push that and you can spawn another bug for the follow on issue that you have identified.

Please let me know what you need/prefer.

Revision history for this message

bugproxy (bugproxy) wrote on 2019-07-12:

#21

------- Comment From <email address hidden> 2019-07-12 04:50 EDT-------
Since the machine which I had is being used for other testing.. I setup another machine to test this again with all latest level + the numad from ppa..

# dpkg -l | grep numad
ii numad 0.5+20150602-5ubuntu0.1~ppa2 ppc64el User-level daemon that monitors NUMA topology and usage
# uname -r
4.15.0-54-generic

Numad crash issue is fixed. I am not able to hit any crashes now...

Revision history for this message

Frank Heimes (fheimes) wrote on 2019-07-12:

#22

@bssrikanth many thanks for testing and feedback!

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-12:

#23

Ok, thanks bssrikanth!

That means we can go on with the SRU.

I'm still sort of frightened by upstream numad seeming dead, but the fix seems clear and now is confirmed to work for you which allows us to go on.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-12:

#24

Uploaded to Eoan ...

Changed in numad (Ubuntu Cosmic):
status:	New → Won't Fix

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-12:

#25

Two new MPs for Bionic/Disco uploads:
- https://code.launchpad.net/~paelzer/ubuntu/+source/numad/+git/numad/+merge/370043
- https://code.launchpad.net/~paelzer/ubuntu/+source/numad/+git/numad/+merge/370044

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-07-12:

#26

This bug was fixed in the package numad - 0.5+20150602-5ubuntu1

---------------
numad (0.5+20150602-5ubuntu1) eoan; urgency=medium

* d/p/lp-1832915-fix-sparse-node-ids.patch: fix a crash on ppc64el
(LP: #1832915)

-- Christian Ehrhardt <email address hidden> Wed, 19 Jun 2019 13:05:33 +0200

Changed in numad (Ubuntu Eoan):
status:	Incomplete → Fix Released

Frank Heimes (fheimes) on 2019-07-12

Changed in ubuntu-power-systems:
status:	Incomplete → In Progress

Christian Ehrhardt  (paelzer) on 2019-07-16

description:	updated
Changed in numad (Ubuntu Bionic):
status:	New → In Progress
Changed in numad (Ubuntu Disco):
status:	New → In Progress

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-16:

#27

MP reviews complete, uploaded to Bionic/Disco unapproved

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-07-16: Please test proposed package

#28

Hello bugproxy, or anyone else affected,

Accepted numad into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/numad/0.5+20150602-5ubuntu0.19.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in numad (Ubuntu Disco):
status:	In Progress → Fix Committed
tags:	added: verification-needed verification-needed-disco
Changed in numad (Ubuntu Bionic):
status:	In Progress → Fix Committed
tags:	added: verification-needed-bionic

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-07-16:

#29

Hello bugproxy, or anyone else affected,

Accepted numad into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/numad/0.5+20150602-5ubuntu0.18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Frank Heimes (fheimes) on 2019-07-16

Changed in ubuntu-power-systems:
status:	In Progress → Fix Committed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#30

Took a P9 system which has spares nodes:
$ ll /sys/bus/node/devices/node*
lrwxrwxrwx 1 root root 0 Jul 17 06:42 /sys/bus/node/devices/node0 -> ../../../devices/system/node/node0/
lrwxrwxrwx 1 root root 0 Jul 17 06:42 /sys/bus/node/devices/node8 -> ../../../devices/system/node/node8/

Install and start numad
$ apt install numad
$ systemctl start numad

Start a KVM guest with 100 CPUs and 64G memory
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=ppc64el label=daily release=eoan
$ uvt-kvm create --memory $((64*1024)) --cpu 100 --password ubuntu eoan arch=ppc64el release=eoan label=daily

Even without putting pressure on the memory we see the expected crash:

Jul 17 08:57:51 dradis kernel: numad[8341]: unhandled signal 11 at 0000712686320e90 nip 000071268451058c lr 00007126845132c0 code 1
Jul 17 08:57:52 dradis systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Jul 17 08:57:52 dradis systemd[1]: numad.service: Failed with result 'core-dump'.

Installing from proposed.
numad/bionic-proposed 0.5+20150602-5ubuntu0.18.04.1 ppc64el [upgradable from: 0.5+20150602-5]

Starting the numad service again and tracking the logs.

1. start guest
2. While that is going on putting some memory pressure on the guest with stressapptest

This time I was able to again trigger a crash with this setup despite using proposed.
Maybe I hit what you had when testing the PPA before.
It seems to occur more rarely but still reliable enough, but I'll try to collect debug data - maybe we find the further issue that is in here as well.

Lets call this verification failed for now, debug and potentially respin the fix to an extended one in Eoan and then reconsider.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#31

New crash that as found is:
#0 0x000002375f1bd2c4 in pick_numa_nodes (pid=<optimized out>, cpus=<optimized out>, mbs=<optimized out>, assume_enough_cpus=<optimized out>) at numad.c:1796
  1791: numad_log(LOG_DEBUG, "Interleaved MBs: %ld\n", ix, p->process_MBs[ix]);
  1792: } else {
  1793: numad_log(LOG_DEBUG, "PROCESS_MBs[%d]: %ld\n", ix, p->process_MBs[ix]);
  1794: }
  1795: }
  1796: if (ID_IS_IN_LIST(ix, p->node_list_p)) {
  1797: proc_avg_node_CPUs_free += node[ix].CPUs_free;
  1798: }
  1799: }
  1800: proc_avg_node_CPUs_free /= NUM_IDS_IN_LIST(p->node_list_p);
  1801: if ((process_has_interleaved_memory) && (keep_interleaved_memory)) {
#1 0x0000000000000000 in ?? ()

That already smells like a different symptom due to the same root cause (sparse node IDs)
Most likely the node[ix] access.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#32

Since this is constructed like:
ADD_ID_TO_LIST(node[0].node_id, target_node_list_p);
I guess this delivers 0 and then 8 in my system
== the node_id instead of the index.

1796 if (ID_IS_IN_LIST(ix, p->node_list_p)) {
1797 proc_avg_node_CPUs_free += node[ix].CPUs_free;
1798 }

While the indexes are 0, 1

I think we'd want to convert our node_id->array-index mapper to a function and use it in the place we fixed first and this one. To then retest this.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#33

Hmm, no this must be different.

This is doing:
for (int ix = 0; (ix <= num_nodes); ix++) {

which essentially is 0,1,2
The 2 is odd here, but it seems to break already at
1796 if (ID_IS_IN_LIST(ix, p->node_list_p)) {

and the latter array access would be fine as ix is currently zero
1797 proc_avg_node_CPUs_free += node[ix].CPUs_free;

I need to disable optimization again to make more sense of it ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#34

Hit another crash:
static id_list_p cpu_bind_list_p;
CLEAR_CPU_LIST(cpu_bind_list_p);

But this is a malloc.c(16) it seems this system currently is broken in general.
tcache_get really shouldn't fail here.
Also I have seen hang_checks in dmesg.

I'll redeploy and give all of this a new try.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#35

It seems on the former deployment I hit some memory bug which broke and stalled quite some allocations. While I haven't found what was causing that (would be an interesting bug report) the renewed systems seems good.

And in that environment I was able to verify the fix just as expected.
Sorry for the noise before, but I try to take these verification serious :-/

With the fix in place I see a correct movement to node 8 for example:
Wed Jul 17 12:33:16 2019: Advising pid 47693 (qemu-system-ppc) move from nodes (0,8) to nodes (8)
Wed Jul 17 12:33:16 2019: PID 47693 moved to node(s) 8 in 0.0 seconds
Wed Jul 17 12:38:21 2019: Advising pid 47693 (qemu-system-ppc) move from nodes (8) to nodes (8)
Wed Jul 17 12:38:21 2019: PID 47693 moved to node(s) 8 in 0.0 seconds
Wed Jul 17 12:43:26 2019: Advising pid 47693 (qemu-system-ppc) move from nodes (8) to nodes (8)
Wed Jul 17 12:43:26 2019: PID 47693 moved to node(s) 8 in 0.0 seconds

Set verification ok for Bionic, upgrading to Disco for the verification there

tags:

added: verification-done-bionic
removed: verification-needed-bionic

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#36

The service worked fine even through a full release upgrade from Bionic to Debian I saw it moving processes just fine.

When on Disco I pushed some load in the guest to get more movements but things worked fine still.

Setting verified for Disco.

P.S. I also think I have found the "other" crash that I have seen, it seems to be triggered by restarting numad while a numad-guided process is active. So in the scenario here get your guest first and then restart numad. I'm filing a bug 1836913 for it so that someone can take a look at that later ...
To be sure that this isn't caused by this update (already quite sure since the place is different) I downgraded to
sudo apt install numad=0.5+20150602-5
And there this issue is triggered as well on restart.

tags:

added: verification-done verification-done-disco
removed: verification-needed verification-needed-disco

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-07-17:

#37

Summarizing the state:
- numad is universe only and IMHO in a rather bad state
  - upstream seems dead for quite some time and does not respond to my patches
- the bug reported here is fixed and verified
- numad seems to have issues on service restart (unrelated to this update)
  -> the upgrades to numad in this SRU will trigger a service restart
  -> this might trigger bug 1836913 in the wild.

@SRU team: is the insight that restart is potentially bad (already before this update) and might be triggered a reason to stop the SRU?

Frank Heimes (fheimes) on 2019-07-22

Changed in ubuntu-power-systems:
status:	Fix Committed → In Progress

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2019-07-25:

#38

I had to sit down and think about this for a moment. The bug with the service restart seems to only happen on ppc64el, which means the issue the package upgrade might trigger might have limited impact. On the other hand, the main target of this bugfix are ppc64el platforms, as those were the most likely to exhibit the original bug.

Before we release this, I would probably feel safer if we know how reproducible bug LP: #1836913 is on ppc64el, i.e. if this is only limited to this one particular device? How frequently does this happen?
Also questions like: how hard would it be to fix it?

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2019-07-25:

#39

Because why I'm worried is that the original bug was only causing issues for numad under certain conditions, but the package upgrade will trigger a restart for *all* the instances of using numad. So if numad restart will cause trouble on all ppc64el cases, I'm worried we might cause more harm with the update than we have without it. Of course it all depends on my earlier questions, maybe it's not that bad.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-08-06:

#40

@Lukasz:
Thanks Lukasz for your thoughts - you confirm my concerns.

Trying to answer your question:
- reproducible
  - it failed on our P9 machine at 100%
  - I don't have another P9 to check if it is specific to "that" machine or P9 in general
  - I was deploying a P8 system to have some comparison
    - bug 1839065 blocked me from using the same workload, so the results are unreliable at
      best
- How frequently
  - in the affected system on every restart of the service (while huge guest was active)
- easy (or not) to fix:
  - from what I saw in the traces it looked like more out of bounds access.
    if that was right then it would be (too) much changes and realyl complex as that kind was in
    many places; but towards the end I got convinced that it might have been a red herring after
    all. Never the less unless it is further debugged the complexity is somewhere between rather-
    complex and unknown

IBM was reporting this bug here for P9, maybe they are the ones with the P9 machine park (different configurations). If it is important to them they can asses and let us know about details for bug 1836913 and we might hold back this SRU for now (being unsure how often we might trigger this).
OTOH any numad service restart will trigger it, it is not that we'd add the bug with this proposed update.

For now I'd say leave it in proposed and we wait if there is IBM feedback on bug 1839065 or bug 1836913.

Christian Ehrhardt  (paelzer) on 2019-08-07

tags:

added: block-proposed

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2019-08-19:

#41

Marking as incomplete while awaiting resolution to bug 1839065 or bug 1836913.

Changed in ubuntu-power-systems:
status:	In Progress → Incomplete

Revision history for this message

Manoj Iyer (manjo) wrote on 2019-09-16:

#42

Wichita was updated with the latest Power8 firmware from IBM and is ready for your testing needs.

Current firwmare version :
P side : FW860.70 (SV860_205)
T side : FW860.70 (SV860_205)
Boot side : FW860.70 (SV860_205)

Changed in ubuntu-power-systems:
status:	Incomplete → Triaged

Łukasz Zemczak (sil2100) on 2019-09-19

tags:

added: block-proposed-bionic block-proposed-disco
removed: block-proposed

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-09-24:

#43

Yeah this is still broken on both machines, sometimes faster sometimes slower to reproduce.
So to summarize we have bug 1832915 reported and a fix created.
But we also have bug 1836913 and potentially a whole set of bugs due to the same conceptual mismatches (assumption in code: numa zones would be linearly indexed, but that isn't true on power).

And all of this on a project that seems sort of dead upstream.
We will keep things as-is for there are systems not affected by this.
going forward we will carry the patches for this bug, but knowing that there is more that will affect power systems with their numa setup.

The SRUs will go to Incomplete - as we'd need to really fix the extended issues to make the backport worth anything.

To do so one would need to spend a significant upstream dev effort on bug 1836913.
That would (if anyone) in my POV be the HW-enablement Team of the ppc64 platform.
So that would be inside IBM I guess?

Changed in numad (Ubuntu Bionic):
status:	Fix Committed → Incomplete
Changed in numad (Ubuntu Disco):
status:	Fix Committed → Incomplete
Changed in numad (Ubuntu Eoan):
status:	Fix Released → Incomplete
assignee:	Canonical Server Team (canonical-server) → nobody
Changed in ubuntu-power-systems:
assignee:	Canonical Server Team (canonical-server) → nobody
no longer affects:	numad (Ubuntu Eoan)
Changed in numad (Ubuntu):
assignee:	nobody → bugproxy (bugproxy)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-09-24:

#44

@Frank - could you make sure in the next calls that the status on these two issues is clear?

Frank Heimes (fheimes) on 2019-09-30

Changed in ubuntu-power-systems:
status:	Triaged → Incomplete

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2019-09-30:

#45

Marking as incomplete while awaiting for numad upstream Power porting work.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-10-21:

#46

TBH, I'd not mark this as prio high from our POV.
It is high to "know if something will come back on this" but not the actual issue.

For the wider Ubuntu community this is just a rarely used universe package with a somewhat dead upstream - nothing to stress out for IMHO.

It is somewhat important to us, if it is important to IBM.
The reason it might be important is that all of this started with bugs reported against it by IBM.
The reasons could be
a) IBM uses it (or plans to) somewhere for production then it should be important to them
b) some odd testcase was run, maybe on an outdated test definition and actually nobody cares, then I guess everyone is fine to close this as won't fix.

Changed in numad (Ubuntu Focal):
importance:	High → Low

Andrew Cloke (andrew-cloke) on 2019-12-09

Changed in ubuntu-power-systems:
importance:	High → Low

Andrew Cloke (andrew-cloke) on 2019-12-16

Changed in numad (Ubuntu Eoan):
status:	New → Incomplete

Revision history for this message

bugproxy (bugproxy) wrote on 2020-02-24: Comment bridged from LTC Bugzilla

#47

------- Comment From <email address hidden> 2020-02-24 02:11 EDT-------
root@ws-g48-2d81-host:~# uname -a
Linux ws-g48-2d81-host 5.4.0-14-generic #17-Ubuntu SMP Thu Feb 6 22:47:13 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux

root@ws-g48-2d81-host:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu Focal Fossa (development branch)"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

root@ws-g48-2d81-host:~# service numad status
? numad.service - numad - The NUMA daemon that manages application locality.
Loaded: loaded (/lib/systemd/system/numad.service; enabled; vendor preset: enabled)
Active: failed (Result: core-dump) since Mon 2020-02-24 01:08:37 EST; 51min ago
Docs: man:numad
Process: 458646 ExecStart=/usr/bin/numad $DAEMON_ARGS -i 15 (code=exited, status=0/SUCCESS)
Main PID: 458655 (code=dumped, signal=SEGV)

Feb 23 13:13:09 ws-g48-2d81-host systemd[1]: Starting numad - The NUMA daemon that manages application locality....
Feb 23 13:13:09 ws-g48-2d81-host systemd[1]: Started numad - The NUMA daemon that manages application locality..
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Failed with result 'core-dump'.

Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990410] numad[458655]: segfault (11) at 2cd49ad1990 nip 7bafaaf6253c lr 2ccc1308580 code 1 in libc-2.30.so[7bafaaeb0000+210000]
Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990424] numad[458655]: code: 60420000 7ba70fa4 7d0a3a2e 2c280000 4182ff14 7ba91f24 3908ffff eba10028
Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990427] numad[458655]: code: ebe10038 7d2a4a14 38c00000 ebc90080 <e8be0000> f8a90080 7d0a3b2e f8de0008
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Failed with result 'core-dump'.

------- Comment From harihare@in.ibm.com 2020-02-24 02:11 EDT-------
root@ws-g48-2d81-host:~# uname -a
Linux ws-g48-2d81-host 5.4.0-14-generic #17-Ubuntu SMP Thu Feb 6 22:47:13 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux

root@ws-g48-2d81-host:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu Focal Fossa (development branch)"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

root@ws-g48-2d81-host:~# service numad status
? numad.service - numad - The NUMA daemon that manages application locality.
Loaded: loaded (/lib/systemd/system/numad.service; enabled; vendor preset: enabled)
Active: failed (Result: core-dump) since Mon 2020-02-24 01:08:37 EST; 51min ago
Docs: man:numad
Process: 458646 ExecStart=/usr/bin/numad $DAEMON_ARGS -i 15 (code=exited, status=0/SUCCESS)
Main PID: 458655 (code=dumped, signal=SEGV)

Feb 23 13:13:09 ws-g48-2d81-host systemd[1]: Starting numad - The NUMA daemon that manages application locality....
Feb 23 13:13:09 ws-g48-2d81-host systemd[1]: Started numad - The NUMA daemon that manages application locality..
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Failed with result 'core-dump'.

Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990410] numad[458655]: segfault (11) at 2cd49ad1990 nip 7bafaaf6253c lr 2ccc1308580 code 1 in libc-2.30.so[7bafaaeb0000+210000]
Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990424] numad[458655]: code: 60420000 7ba70fa4 7d0a3a2e 2c280000 4182ff14 7ba91f24 3908ffff eba10028
Feb 24 01:08:37 ws-g48-2d81-host kernel: [420098.990427] numad[458655]: code: ebe10038 7d2a4a14 38c00000 ebc90080 <e8be0000> f8a90080 7d0a3b2e f8de0008
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Main process exited, code=dumped, status=11/SEGV
Feb 24 01:08:37 ws-g48-2d81-host systemd[1]: numad.service: Failed with result 'core-dump'.

Frank Heimes (fheimes) on 2020-02-24

Changed in numad (Ubuntu Disco):
status:	Incomplete → Won't Fix

Revision history for this message

bugproxy (bugproxy) wrote on 2020-03-20:

#48

------- Comment From <email address hidden> 2020-03-20 16:35 EDT-------
Hello Canonical,

So, this is still an issue in Ubuntu 20.04, as the last test results shows. Is this something you would be willing to fix?

tags:

added: targetmilestone-inin2004
removed: targetmilestone-inin---

Revision history for this message

Frank Heimes (fheimes) wrote on 2020-03-20:

#49

Hi, this bug has a 'sister' bug: LP 1836913
The outcome on the numad discussion in the interlock calls with IBM (based on these two bugs) was that proper upstream support and fixing from IBM is needed especially for Power.
Some structural issues where identified that can't be easily fixed, there is more to do.
Please see: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1836913/comments/8 etc.
So we believed that there is already some upstream work going on by the IBM Power team.
An upstream accepted version or patches can again be considered to be SRUed back to Ubuntu.

Revision history for this message

bugproxy (bugproxy) wrote on 2020-04-06:

#50

------- Comment From <email address hidden> 2020-04-06 18:06 EDT-------
Reclassifying as P3/low to match 'numad' classification.

tags:

added: severity-low
removed: severity-high

Frank Heimes (fheimes) on 2020-04-20

tags:

added: hwe-long-running

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-11-23:

#51

This bug was fixed in the package numad - 0.5+20150602-6

---------------
numad (0.5+20150602-6) unstable; urgency=medium

  [ Christian Ehrhardt ]
  * d/p/lp-1832915-fix-sparse-node-ids.patch: fix a crash on ppc64el
    (LP: #1832915)(Closes: #930725)

  [ gustavo panizzo ]
  * [0b4115] add patch from upstream repo
  * [d97937] update homepage
  * [1b3223] update vcs-* urls to point to salsa.d.o
  * [767be9] do not require root to build
  * [b6b360] use the latest debhelper-compat
  * [2eee3c] guessing a debian/watch file
  * [cecd01] update the d/gbp.conf file
  * [953d8c] increase standards version to 4.5.0
  * [094901] remove trailing spaces and comments from d/rules
  * [a267dd] use a secure uri for the copyright format
  * [577311] install a logrotate file
  * [47d07d] no longer install upstream changelog

-- gustavo panizzo <email address hidden> Fri, 20 Nov 2020 22:22:20 +0000

Changed in numad (Ubuntu):
status:	Incomplete → Fix Released

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2021-01-11:

#52

Eoan is now EOL. Marking as "won't fix".

Changed in numad (Ubuntu Eoan):
status:	Incomplete → Invalid

Frank Heimes (fheimes) on 2021-01-11

Changed in numad (Ubuntu Eoan):
status:	Invalid → Won't Fix

Mathew Hodson (mhodson) on 2021-01-22

Changed in numad (Ubuntu Bionic):
importance:	Undecided → Low
Changed in numad (Ubuntu Cosmic):
importance:	Undecided → Low
Changed in numad (Ubuntu Disco):
importance:	Undecided → Low
Changed in numad (Ubuntu Eoan):
importance:	Undecided → Low
tags:	removed: block-proposed-disco verification-done verification-done-disco

Revision history for this message

Mathew Hodson (mhodson) wrote on 2021-01-24:

#53

Setting package status based on what was released.
---

numad (0.5+20150602-5ubuntu1) eoan; urgency=medium

* d/p/lp-1832915-fix-sparse-node-ids.patch: fix a crash on ppc64el
(LP: #1832915)

-- Christian Ehrhardt <email address hidden> Wed, 19 Jun 2019 13:05:33 +0200

Changed in numad (Ubuntu Focal):
status:	Incomplete → Fix Released
Changed in numad (Ubuntu Eoan):
status:	Won't Fix → Fix Released

Bug Watch Updater (bug-watch-updater) on 2021-03-08

Changed in numad (Debian):
status:	New → Fix Released

Frank Heimes (fheimes) on 2022-12-06

Changed in numad (Ubuntu):
assignee:	bugproxy (bugproxy) → nobody
Changed in numad (Ubuntu Focal):
assignee:	bugproxy (bugproxy) → nobody

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2023-06-01: Proposed package removed from archive

#54

The version of numad in the proposed pocket of Bionic that was purported to fix this bug report has been removed because the target series has reached its End of Life.

Frank Heimes (fheimes) on 2023-06-01

Changed in ubuntu-power-systems:
status:	Incomplete → Fix Released

Revision history for this message

bugproxy (bugproxy) wrote on 2023-06-01: sosreport on host

#60

sosreport on host Edit (10.4 KiB, application/x-xz)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2023-06-01:

#61

sosreport on host Edit (10.4 KiB, application/octet-stream)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2023-06-01:

#62

sosreport on host Edit (9.4 MiB, application/octet-stream)

Default Comment by Bridge

Frank Heimes (fheimes) on 2023-06-01

Changed in numad (Ubuntu Bionic):
status:	Incomplete → Won't Fix

Ubuntu
numad package

numad crashes while running kvm guest

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
The Ubuntu-power-systems project	Fix Released	Low	Unassigned
numad (Debian)	Fix Released	Unknown	debbugs #930725
numad (Ubuntu)	Fix Released	Low	Unassigned
Bionic	Won't Fix	Low	Unassigned
Cosmic	Won't Fix	Low	Unassigned
Disco	Won't Fix	Low	Unassigned
Eoan	Fix Released	Low	Unassigned
Focal	Fix Released	Low	Unassigned

Ubuntunumad package

numad crashes while running kvm guest

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
numad package