Ubuntu
linux package

Bug #1461620
Comment #1

Comment 1 for bug 1461620

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2015-06-03:

You can follow my comments in LKML:

https://lkml.org/lkml/2015/3/6/484

"""
Basically in kernel 3.13 we are getting the follow situation:

I have a core dump locked on the same place
(state machine for powering cpu down for the task swap) from a 3.13 (+
upstream patches) and this commit wasn't backported yet.

-> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).

my case:

PID: 118 TASK: ffff883fd28ec7d0 CPU: 9 COMMAND: "migration/9"
...
    [exception RIP: multi_cpu_stop+0x64]
    RIP: ffffffff810f5944 RSP: ffff883fd2907d98 RFLAGS: 00000246
    RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
    RDX: ffff883fd2907d98 RSI: 0000000000000000 RDI: 0000000000000001
    RBP: ffffffff810f5944 R8: ffffffff810f5944 R9: 0000000000000000
    R10: ffff883fd2907d98 R11: 0000000000000246 R12: ffffffffffffffff
    R13: ffff883f55d01b48 R14: 0000000000000000 R15: 0000000000000001
    ORIG_RAX: 0000000000000001 CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
208 } while (curstate != MULTI_STOP_EXIT);
       ---> RIP
RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx
       ---> CHECKING FOR MULTI_STOP_EXIT
RDX: ffff883fd2907d98 -> does not make any sense
###

If i'm reading this right,

"""
CPU 05 - PID 14990

do_numa_page
task_numa_fault
numa_migrate_preferred
task_numa_migrate
migrate_swap (curr: 14990, task: 14996)
stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
wait_for_completion

14990 - CPU05
14996 - CPU00

stop_two_cpus:
    multi_stop_data (msdata->state = MULTI_STOP_PREPARE)
    smp_call_function_single (min=cpu2=00, irq_cpu_stop_queue_work, wait=1)
        smp_call_function_single (ran on lowest CPU, 00 for this case)
        irq_cpu_stop_queue_work
            cpu_stop_queue_work(cpu1=05(14996)) # add work
(multi_cpu_stop) to cpu 05 cpu_stopper queue
            cpu_stop_queue_work(cpu2=00(14990)) # add work
(multi_cpu_stop) to cpu 00 cpu_stopper queue
    wait_for_completion() --> HERE
"""

in my case, checking task structs for tasks scheduled when
"waiting_for_completion()":

PID 14990 CPU 05 -> PID 14996 CPU 00
PID 14991 CPU 30 -> PID 14998 CPU 01
PID 14992 CPU 30 -> PID 14998 CPU 01
PID 14996 CPU 00 -> PID 14992 CPU 30
PID 14998 CPU 01 -> PID 14990 CPU 05

AND

THEN

I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
running yet.

AND

I don't have any "wait_for_completion" for those "OLDER" migration
threads (6, 9, 14, 15 and 16)
Probably wait_for_completion signaled done.completion before racing.

Looks like something messed up with curstate in the "multi_cpu_stop"
state machine.
"""

You can follow my comments in LKML:

https://lkml.org/lkml/2015/3/6/484

"""
Basically in kernel 3.13 we are getting the follow situation:

I have a core dump locked on the same place
(state machine for powering cpu down for the task swap) from a 3.13 (+
upstream patches) and this commit wasn't backported yet.

-> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).

my case:

PID: 118    TASK: ffff883fd28ec7d0  CPU: 9   COMMAND: "migration/9"
...
    [exception RIP: multi_cpu_stop+0x64]
    RIP: ffffffff810f5944  RSP: ffff883fd2907d98  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
    RDX: ffff883fd2907d98  RSI: 0000000000000000  RDI: 0000000000000001
    RBP: ffffffff810f5944   R8: ffffffff810f5944   R9: 0000000000000000
    R10: ffff883fd2907d98  R11: 0000000000000246  R12: ffffffffffffffff
    R13: ffff883f55d01b48  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: 0000000000000001  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
208              } while (curstate != MULTI_STOP_EXIT);
       ---> RIP
RIP 0xffffffff810f5944 <+100>:   cmp    $0x4,%edx
       ---> CHECKING FOR MULTI_STOP_EXIT
RDX: ffff883fd2907d98 -> does not make any sense
###

If i'm reading this right,

"""
CPU 05 - PID 14990

do_numa_page
task_numa_fault
numa_migrate_preferred
task_numa_migrate
migrate_swap (curr: 14990, task: 14996)
stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
wait_for_completion

14990 - CPU05
14996 - CPU00

in my case, checking task structs for tasks scheduled when
"waiting_for_completion()":

PID 14990 CPU 05 -> PID 14996 CPU 00
PID 14991 CPU 30 -> PID 14998 CPU 01
PID 14992 CPU 30 -> PID 14998 CPU 01
PID 14996 CPU 00 -> PID 14992 CPU 30
PID 14998 CPU 01 -> PID 14990 CPU 05

AND

>   102      2   6  ffff881fd2ea97f0  RU   0.0       0      0  [migration/6]
>   118      2   9  ffff883fd28ec7d0  RU   0.0       0      0  [migration/9]
>   143      2  14  ffff883fd29d47d0  RU   0.0       0      0  [migration/14]
>   148      2  15  ffff883fd29fc7d0  RU   0.0       0      0  [migration/15]
>   153      2  16  ffff881fd2f517f0  RU   0.0       0      0  [migration/16]

THEN

I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
running yet.

AND

I don't have any "wait_for_completion" for those "OLDER" migration
threads (6, 9, 14, 15 and 16)
Probably wait_for_completion signaled done.completion before racing.

Looks like something messed up with curstate in the "multi_cpu_stop"
state machine.
"""

Ubuntulinux package

Comment 1 for bug 1461620

Ubuntu
linux package