NUMA task migration race condition due to stop task not being checked when balancing happens
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Fix Released
|
Medium
|
Rafael David Tinoco | ||
Vivid |
Fix Released
|
Undecided
|
Rafael David Tinoco |
Bug Description
SRU Justification:
Impact:
- Deadlock when migrating processes in between NUMA domains.
- Came with 1 kernel dump given to me.
- Hard to trigger.
Fix:
- Upstream development after upstream discussion.
- Discussion: https:/
- commit b17718d02f54b90
Testcase:
- Stress test in a virtual NUMA environment
- Wait indefinitely... Hard to trigger
- https:/
- Can, at least, make sure the logic did not introduce regression
----
It was brought to my attention the follow kernel panic:
"""
[3367068.076488] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f
[3367068.092735] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:153]
[3367068.100368] Modules linked in: iptable_raw xt_nat xt_REDIRECT veth openvswitch(OF) gre vxlan ip_tunnel libcrc32c dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ipmi_devintf 8021q garp stp mrp llc bonding x86_pkg_
[3367068.100409] libfc megaraid_sas pps_core mdio usbhid scsi_transport_fc hid enic scsi_tgt
[3367068.100415] CPU: 16 PID: 153 Comm: migration/16 Tainted: GF O 3.13.0-34-generic #60-Ubuntu
[3367068.100417] Hardware name: Cisco Systems Inc UCSC-C220-
[3367068.100419] task: ffff881fd2f517f0 ti: ffff881fd2f1c000 task.ti: ffff881fd2f1c000
[3367068.100420] RIP: 0010:[<
[3367068.100426] RSP: 0000:ffff881fd2
[3367068.100427] RAX: ffffffff8180af40 RBX: 0000000000000086 RCX: 000000000000a402
[3367068.100428] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff883e607edb48
[3367068.100430] RBP: ffff881fd2f1ddb8 R08: 0000000000000282 R09: 0000000000000001
[3367068.100431] R10: 000000000000b6d8 R11: ffff881fc374dc80 R12: 0000000000014440
[3367068.100432] R13: ffff881fd291ae00 R14: ffff881fd291ae08 R15: 0000000200000010
[3367068.100433] FS: 000000000000000
[3367068.100434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3367068.100435] CR2: 00007f6202134b98 CR3: 0000000001c0e000 CR4: 00000000001407e0
[3367068.100437] Stack:
[3367068.100438] ffff883e607edb70 ffff881fffd0ede0 ffff881fffd0ede8 ffff883e607edb48
[3367068.100441] ffff881fd2f1de78 ffffffff810f5b5e ffffffff8109dfc4 ffff881fffd14440
[3367068.100443] ffff881fd2f1de08 ffffffff81097508 0000000000000000 ffff881fffd14440
[3367068.100446] Call Trace:
[3367068.100450] [<ffffffff810f5
[3367068.100454] [<ffffffff8109d
[3367068.100458] [<ffffffff81097
[3367068.100462] [<ffffffff8171f
[3367068.100465] [<ffffffff81092
[3367068.100467] [<ffffffff81092
[3367068.100470] [<ffffffff8108b
[3367068.100473] [<ffffffff8108b
[3367068.100477] [<ffffffff8172c
[3367068.100479] [<ffffffff8108b
[3367068.100480] Code: db 85 db 41 0f 95 c5 31 f6 31 d2 eb 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da <83> fa 04 74 3d f3 90 41 8b 5c 24 20 39 d3 74 f0 83 fb 02 75 d7
"""
I'm explaining WHY this is happening in the first comments and HOW to fix it.
Related branches
CVE References
Changed in linux (Ubuntu): | |
status: | In Progress → Invalid |
Changed in linux (Ubuntu Trusty): | |
status: | New → In Progress |
assignee: | nobody → Rafael David Tinoco (inaddy) |
Changed in linux (Ubuntu): | |
assignee: | Rafael David Tinoco (inaddy) → nobody |
Changed in linux (Ubuntu Trusty): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Vivid): | |
status: | New → In Progress |
assignee: | nobody → Rafael David Tinoco (inaddy) |
description: | updated |
description: | updated |
description: | updated |
Changed in linux (Ubuntu): | |
status: | Invalid → Fix Committed |
Changed in linux (Ubuntu Trusty): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Vivid): | |
status: | In Progress → Fix Committed |
tags: | added: sts |
tags: | added: cts |
You can follow my comments in LKML:
https:/ /lkml.org/ lkml/2015/ 3/6/484
"""
Basically in kernel 3.13 we are getting the follow situation:
I have a core dump locked on the same place
(state machine for powering cpu down for the task swap) from a 3.13 (+
upstream patches) and this commit wasn't backported yet.
-> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).
Register totally messed up (probably after cpu_relax(), right where
you were trapped -> after the pause instruction).
my case:
PID: 118 TASK: ffff883fd28ec7d0 CPU: 9 COMMAND: "migration/9" stop+0x64]
...
[exception RIP: multi_cpu_
RIP: ffffffff810f5944 RSP: ffff883fd2907d98 RFLAGS: 00000246
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
RDX: ffff883fd2907d98 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffffffff810f5944 R8: ffffffff810f5944 R9: 0000000000000000
R10: ffff883fd2907d98 R11: 0000000000000246 R12: ffffffffffffffff
R13: ffff883f55d01b48 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: 0000000000000001 CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
208 } while (curstate != MULTI_STOP_EXIT);
---> RIP
RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx
---> CHECKING FOR MULTI_STOP_EXIT
RDX: ffff883fd2907d98 -> does not make any sense
###
If i'm reading this right,
"""
CPU 05 - PID 14990
do_numa_page preferred
task_numa_fault
numa_migrate_
task_numa_migrate
migrate_swap (curr: 14990, task: 14996)
stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
wait_for_completion
14990 - CPU05
14996 - CPU00
stop_two_cpus: call_function_ single (min=cpu2=00, irq_cpu_ stop_queue_ work, wait=1)
smp_call_ function_ single (ran on lowest CPU, 00 for this case)
irq_cpu_ stop_queue_ work
cpu_ stop_queue_ work(cpu1= 05(14996) ) # add work
cpu_ stop_queue_ work(cpu2= 00(14990) ) # add work for_completion( ) --> HERE
multi_stop_data (msdata->state = MULTI_STOP_PREPARE)
smp_
(multi_cpu_stop) to cpu 05 cpu_stopper queue
(multi_cpu_stop) to cpu 00 cpu_stopper queue
wait_
"""
in my case, checking task structs for tasks scheduled when for_completion( )":
"waiting_
PID 14990 CPU 05 -> PID 14996 CPU 00
PID 14991 CPU 30 -> PID 14998 CPU 01
PID 14992 CPU 30 -> PID 14998 CPU 01
PID 14996 CPU 00 -> PID 14992 CPU 30
PID 14998 CPU 01 -> PID 14990 CPU 05
AND
> 102 2 6 ffff881fd2ea97f0 RU 0.0 0 0 [migration/6]
> 118 2 9 ffff883fd28ec7d0 RU 0.0 0 0 [migration/9]
> 143 2 14 ffff883fd29d47d0 RU 0.0 0 0 [migration/14]
> 148 2 15 ffff883fd29fc7d0 RU 0.0 0 0 [migration/15]
> 153 2 16 ffff881fd2f517f0 RU 0.0 0 0 [migration/16]
THEN
I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
running yet.
AND
I don't have any "wait_for_ completion" for those "OLDER" migration
threads (6, 9, 14, 15 and 16)
Probably wait_for_completion s...