Comment 7 for bug 1832915

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

(gdb) p cpu_bind_list_p->bytes
$5 = 24
(gdb) p *(cpu_bind_list_p->set_p)
$7 = {__bits = {1229782938247303441, 4369, 0, 49, 1955697441360, 274, 303148778372988952, 139284342967816, 48, 337, 1955697440432, 1955697400112, 1955697400080, 1955697400048,
    1955697400016, 1955697400512}}
(gdb) p sizeof(*(cpu_bind_list_p->set_p))
$8 = 128

See the size mismatch?
It will allocate 24 bytes and needs 128.

I think this is a bad intialization.
We have these operations in the code on cpu_bind_list_p.

static id_list_p cpu_bind_list_p;
CLEAR_CPU_LIST(cpu_bind_list_p);
OR_LISTS(cpu_bind_list_p, cpu_bind_list_p, node[node_id].cpu_list_p);

Now CLEAR_CPU_LIST has some init code, but only if == NULL.
#define CLEAR_CPU_LIST(list_p) \
    if (list_p == NULL) { \
        INIT_ID_LIST(list_p, num_cpus); \
    } \
    CPU_ZERO_S(list_p->bytes, list_p->set_p)

Since we can't rely on data in that static var "by accident" it might have stale old data.

Note: The other chance of errors is the 40 active CPUs vs the 160 potential CPUs (SMT off) that I have in my system.
The size is from num_cpu - if that detection is off then it might fail as well.
But at least in all my crashes that was ok.
(gdb) p num_cpus
$10 = 160

So lets assume it is the lack of (re)initialization for now.
Other structures of type "id_list_p" are all initialized with NULL btw.
Like:
  id_list_p all_cpus_list_p = NULL;
  id_list_p all_nodes_list_p = NULL;
  id_list_p reserved_cpu_mask_list_p = NULL;