NUMA task migration race condition due to stop task not being checked when balancing happens
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux (Ubuntu) |
Undecided
|
Unassigned | ||
| | Trusty |
Medium
|
Rafael David Tinoco | ||
| | Vivid |
Undecided
|
Rafael David Tinoco | ||
Bug Description
SRU Justification:
Impact:
- Deadlock when migrating processes in between NUMA domains.
- Came with 1 kernel dump given to me.
- Hard to trigger.
Fix:
- Upstream development after upstream discussion.
- Discussion: https:/
- commit b17718d02f54b90
Testcase:
- Stress test in a virtual NUMA environment
- Wait indefinitely... Hard to trigger
- https:/
- Can, at least, make sure the logic did not introduce regression
----
It was brought to my attention the follow kernel panic:
"""
[3367068.076488] Code: 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da 83 fa 04 74 3d f3 90 41 8b 5c 24 20 <39> d3 74 f0 83 fb 02 75 d7 fa 66 0f 1f 44 00 00 eb d8 66 0f 1f
[3367068.092735] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:153]
[3367068.100368] Modules linked in: iptable_raw xt_nat xt_REDIRECT veth openvswitch(OF) gre vxlan ip_tunnel libcrc32c dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ipmi_devintf 8021q garp stp mrp llc bonding x86_pkg_
[3367068.100409] libfc megaraid_sas pps_core mdio usbhid scsi_transport_fc hid enic scsi_tgt
[3367068.100415] CPU: 16 PID: 153 Comm: migration/16 Tainted: GF O 3.13.0-34-generic #60-Ubuntu
[3367068.100417] Hardware name: Cisco Systems Inc UCSC-C220-
[3367068.100419] task: ffff881fd2f517f0 ti: ffff881fd2f1c000 task.ti: ffff881fd2f1c000
[3367068.100420] RIP: 0010:[<
[3367068.100426] RSP: 0000:ffff881fd2
[3367068.100427] RAX: ffffffff8180af40 RBX: 0000000000000086 RCX: 000000000000a402
[3367068.100428] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff883e607edb48
[3367068.100430] RBP: ffff881fd2f1ddb8 R08: 0000000000000282 R09: 0000000000000001
[3367068.100431] R10: 000000000000b6d8 R11: ffff881fc374dc80 R12: 0000000000014440
[3367068.100432] R13: ffff881fd291ae00 R14: ffff881fd291ae08 R15: 0000000200000010
[3367068.100433] FS: 000000000000000
[3367068.100434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3367068.100435] CR2: 00007f6202134b98 CR3: 0000000001c0e000 CR4: 00000000001407e0
[3367068.100437] Stack:
[3367068.100438] ffff883e607edb70 ffff881fffd0ede0 ffff881fffd0ede8 ffff883e607edb48
[3367068.100441] ffff881fd2f1de78 ffffffff810f5b5e ffffffff8109dfc4 ffff881fffd14440
[3367068.100443] ffff881fd2f1de08 ffffffff81097508 0000000000000000 ffff881fffd14440
[3367068.100446] Call Trace:
[3367068.100450] [<ffffffff810f5
[3367068.100454] [<ffffffff8109d
[3367068.100458] [<ffffffff81097
[3367068.100462] [<ffffffff8171f
[3367068.100465] [<ffffffff81092
[3367068.100467] [<ffffffff81092
[3367068.100470] [<ffffffff8108b
[3367068.100473] [<ffffffff8108b
[3367068.100477] [<ffffffff8172c
[3367068.100479] [<ffffffff8108b
[3367068.100480] Code: db 85 db 41 0f 95 c5 31 f6 31 d2 eb 23 66 2e 0f 1f 84 00 00 00 00 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da <83> fa 04 74 3d f3 90 41 8b 5c 24 20 39 d3 74 f0 83 fb 02 75 d7
"""
I'm explaining WHY this is happening in the first comments and HOW to fix it.
Related branches
CVE References
| Rafael David Tinoco (inaddy) wrote : | #1 |
| Changed in linux (Ubuntu): | |
| status: | New → In Progress |
| assignee: | nobody → Rafael David Tinoco (inaddy) |
| Rafael David Tinoco (inaddy) wrote : | #2 |
Sasha pointed me the a fix for this particular behaviour in between 3.16 and 3.17:
https:/
[PATCH] sched: Checking for stop task appearance when balancing happens
Saying that indeed mine previous observation:
"""
--- <NMI exception stack> ---
#4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
208 } while (curstate != MULTI_STOP_EXIT);
---> RIP
RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx
---> CHECKING FOR MULTI_STOP_EXIT
RDX: ffff883fd2907d98 -> does not make any sense
"""
was right due to a stop task being picked by scheduler when it should not.
And this commit is present into:
$ git tag --contains a1d9a3231eac411
v3.15-rc2
v3.15-rc3
...
v4.1-rc5
So only Trusty is affected.
| Rafael David Tinoco (inaddy) wrote : | #3 |
It happens that the fix relies on checking if the stop worker needs task selection re-start:
+ if (need_pull_
pull_
+ /*
+ * pull_rt_task() can drop (and re-acquire) rq->lock; this
+ * means a stop task can slip in, in which case we need to
+ * re-start task selection.
+ */
+ if (rq->stop && rq->stop->on_rq)
+ return RETRY_TASK;
And this is done by returning RETRY_TASK. This logic was not available in 3.13 AND I don't want to jeopardise our 3.13 scheduler.
| Rafael David Tinoco (inaddy) wrote : | #4 |
To understand better if this bug was triggered easy I created the following test case:
I've been using a KVM guest emulating a NUMA environment with 32 different domains (1 for each vCPU):
root@numa:~# numactl -H
available: 32 nodes (0-31)
node 0 cpus: 0
node 0 size: 237 MB
node 0 free: 82 MB
node 1 cpus: 1
node 1 size: 251 MB
node 1 free: 15 MB
node 2 cpus: 2
node 2 size: 251 MB
node 2 free: 52 MB
node 3 cpus: 3
node 3 size: 251 MB
node 3 free: 240 MB
node 4 cpus: 4
node 4 size: 251 MB
node 4 free: 15 MB
node 5 cpus: 5
node 5 size: 251 MB
node 5 free: 15 MB
node 6 cpus: 6
node 6 size: 251 MB
node 6 free: 17 MB
node 7 cpus: 7
node 7 size: 251 MB
node 7 free: 15 MB
node 8 cpus: 8
node 8 size: 251 MB
node 8 free: 16 MB
node 9 cpus: 9
node 9 size: 251 MB
node 9 free: 16 MB
node 10 cpus: 10
node 10 size: 251 MB
node 10 free: 15 MB
node 11 cpus: 11
node 11 size: 187 MB
node 11 free: 13 MB
node 12 cpus: 12
node 12 size: 251 MB
node 12 free: 15 MB
node 13 cpus: 13
node 13 size: 251 MB
node 13 free: 17 MB
node 14 cpus: 14
node 14 size: 251 MB
node 14 free: 15 MB
node 15 cpus: 15
node 15 size: 251 MB
node 15 free: 16 MB
node 16 cpus: 16
node 16 size: 251 MB
node 16 free: 17 MB
node 17 cpus: 17
node 17 size: 251 MB
node 17 free: 17 MB
node 18 cpus: 18
node 18 size: 251 MB
node 18 free: 16 MB
node 19 cpus: 19
node 19 size: 251 MB
node 19 free: 15 MB
node 20 cpus: 20
node 20 size: 251 MB
node 20 free: 16 MB
node 21 cpus: 21
node 21 size: 251 MB
node 21 free: 17 MB
node 22 cpus: 22
node 22 size: 251 MB
node 22 free: 51 MB
node 23 cpus: 23
node 23 size: 251 MB
node 23 free: 37 MB
node 24 cpus: 24
node 24 size: 251 MB
node 24 free: 120 MB
node 25 cpus: 25
node 25 size: 251 MB
node 25 free: 115 MB
node 26 cpus: 26
node 26 size: 251 MB
node 26 free: 41 MB
node 27 cpus: 27
node 27 size: 251 MB
node 27 free: 15 MB
node 28 cpus: 28
node 28 size: 251 MB
node 28 free: 15 MB
node 29 cpus: 29
node 29 size: 251 MB
node 29 free: 17 MB
node 30 cpus: 30
node 30 size: 251 MB
node 30 free: 164 MB
node 31 cpus: 31
node 31 size: 251 MB
node 31 free: 228 MB
And stressing the environment (as you can see in "free memory" for every NUMA node with a specific tool that allocates a certain amount of memory and "touches" every 32 bytes of this memory (and dirtying it at the end, restarting the same behavior). Together with that I'm creating enough kernel tasks concurrent to these memory allocators for them to compete for CPU -> forcing the memory threads to migrate between CPUs (and NUMA domains since every CPU is inside a different NUMA domain).
| Rafael David Tinoco (inaddy) wrote : | #5 |
Using ftrace I can make sure that we are triggering the logic that is responsible for the dead lock to happen (in a frequent basis) but until now without the success of making it to happen.
root@numa:~# trace-cmd record -p function -l numa_migrate_
...
stress-1547 [012] 136.309393: function: numa_migrate_
stress-1547 [012] 136.309394: function: task_numa_migrate
stress-1547 [012] 136.309414: function: migrate_swap
stress-1547 [012] 136.309414: function: stop_two_cpus
stress-1539 [017] 136.309519: function: numa_migrate_
stress-1539 [017] 136.309519: function: task_numa_migrate
stress-1539 [017] 136.309528: function: migrate_swap
stress-1539 [017] 136.309528: function: stop_two_cpus
stress-1563 [006] 136.313389: function: numa_migrate_
stress-1563 [006] 136.313391: function: task_numa_migrate
stress-1428 [004] 136.313415: function: numa_migrate_
stress-1428 [004] 136.313416: function: task_numa_migrate
stress-1428 [004] 136.313434: function: migrate_swap
stress-1428 [004] 136.313434: function: stop_two_cpus
stress-1421 [016] 136.325398: function: numa_migrate_
stress-1464 [025] 136.386219: function: numa_migrate_
stress-1464 [025] 136.386221: function: task_numa_migrate
stress-1464 [025] 136.386240: function: migrate_swap
stress-1464 [025] 136.386241: function: stop_two_cpus
stress-1435 [014] 136.400792: function: numa_migrate_
stress-1435 [014] 136.400793: function: task_numa_migrate
<...>-1513 [023] 136.401345: function: numa_migrate_
stress-1447 [019] 136.410245: function: numa_migrate_
stress-1447 [019] 136.410246: function: task_numa_migrate
stress-1517 [012] 136.413338: function: numa_migrate_
stress-1554 [024] 136.417383: function: numa_migrate_
stress-1554 [024] 136.417384: function: task_numa_migrate
stress-1554 [024] 136.417407: function: migrate_swap
stress-1554 [024] 136.417408: function: stop_two_cpus
<...>-1507 [023] 136.421348: function: numa_migrate_
stress-1500 [018] 136.445321: function: numa_migrate_
stress-1525 [025] 136.473330: function: numa_migrate_
stress-1472 [029] 136.502245: function: numa_migrate_
stress-1472 [029] 136.502247: function: task_numa_migrate
stress-1472 [029] 136.502270: function: migrate_swap
stress-1472 [029] 136.502270: function: stop_two_cpus
stress-1496 [004] 136.569273: function: numa_migrate_
stress-1496 [004] 136.569275: function: task_numa_migrate
...
root@ttwcnuma:~# trace-cmd report | grep stop_two_cpus | wc -l
475
Meaning that I caused a task to be migrated between NUMA domains 475 times in less the 3 seconds.
| Rafael David Tinoco (inaddy) wrote : | #6 |
But unfortunately I could not reproduce the issue (although I know it is in there). I'll create a small logic similar to:
Commit a1d9a3231eac411
Author: Kirill Tkhai <email address hidden>
Date: Thu Apr 10 17:38:36 2014 +0400
sched: Check for stop task appearance when balancing happens
We need to do it like we do for the other higher priority classes..
Signed-off-by: Kirill Tkhai <email address hidden>
Cc: Michael wang <email address hidden>
Cc: Sasha Levin <email address hidden>
Signed-off-by: Peter Zijlstra <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
Where I'll just "bypass" task selection instead of returning RETRY_TASK. Since 3.13 scheduler does not have the RETRY_TASK logic, it will be just a question of not choosing the stop worker (kthread) to run in the same conditions (since the rest is pretty much the same).
Asking for kernel team review while I work on this.
| Changed in linux (Ubuntu): | |
| status: | In Progress → Invalid |
| Changed in linux (Ubuntu Trusty): | |
| status: | New → In Progress |
| assignee: | nobody → Rafael David Tinoco (inaddy) |
| Changed in linux (Ubuntu): | |
| assignee: | Rafael David Tinoco (inaddy) → nobody |
| Changed in linux (Ubuntu Trusty): | |
| importance: | Undecided → Medium |
| Rafael David Tinoco (inaddy) wrote : | #7 |
Just got an update from Peter:
https:/
asking for feedback on a patch:
Subject: stop_machine: Fix deadlock between multiple stop_two_cpus()
From: Peter Zijlstra <email address hidden>
Date: Fri, 5 Jun 2015 17:30:23 +0200
Will try to test the latest builds + this patch with the NUMA migration test. Unfortunately it is REALLY hard to reproduce the issue so I cannot know if the patch fixed anything, just test if it looks good or not.
| Rafael David Tinoco (inaddy) wrote : | #8 |
I'm running the NUMA tests on 3.13 for some time now and it looks like the change did not introduce any regression...
$ uname -a
Linux sf00079894trusty 3.13.11-
I'm using a virtualized 16 Domains / 16 CPUs NUMA environment with the stress test tool:
$ sudo numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0
node 0 size: 363 MB
node 0 free: 23 MB
node 1 cpus: 1
node 1 size: 121 MB
node 1 free: 7 MB
node 2 cpus: 2
node 2 size: 377 MB
node 2 free: 23 MB
node 3 cpus: 3
node 3 size: 377 MB
node 3 free: 23 MB
node 4 cpus: 4
node 4 size: 377 MB
node 4 free: 23 MB
node 5 cpus: 5
node 5 size: 377 MB
node 5 free: 23 MB
node 6 cpus: 6
node 6 size: 377 MB
node 6 free: 35 MB
node 7 cpus: 7
node 7 size: 313 MB
node 7 free: 19 MB
node 8 cpus: 8
node 8 size: 377 MB
node 8 free: 61 MB
node 9 cpus: 9
node 9 size: 377 MB
node 9 free: 57 MB
node 10 cpus: 10
node 10 size: 377 MB
node 10 free: 63 MB
node 11 cpus: 11
node 11 size: 377 MB
node 11 free: 30 MB
node 12 cpus: 12
node 12 size: 377 MB
node 12 free: 67 MB
node 13 cpus: 13
node 13 size: 377 MB
node 13 free: 68 MB
node 14 cpus: 14
node 14 size: 377 MB
node 14 free: 68 MB
node 15 cpus: 15
node 15 size: 377 MB
node 15 free: 64 MB
$ sudo stress --vm 16 --vm-bytes 314572800 --vm-stride 1 --vm-keep &
Causing memory allocations of around 300MB on each node and "touching" every byte of the allocation (causing all the pages to be "hot" on the CPU running).
And generating concurrency:
$ sudo stress --cpu 16 &
So kernel scheduler has to migrate tasks, triggering the buggy logic's fix. I can confirm the logic is being triggered by using ftrace:
$ sudo trace-cmd record -p function -l numa_migrate_
$ sudo trace-cmd report | grep stop_two_cpus | wc -l162
And can't find any regression.
I'll let the tests to run a bit more and will suggest the fix to our kernel team to merge it as a Stable Release Update for Trusty, Utopic and Vivid.
| Changed in linux (Ubuntu Vivid): | |
| status: | New → In Progress |
| assignee: | nobody → Rafael David Tinoco (inaddy) |
| description: | updated |
| description: | updated |
| description: | updated |
| Changed in linux (Ubuntu): | |
| status: | Invalid → Fix Committed |
| Changed in linux (Ubuntu Trusty): | |
| status: | In Progress → Fix Committed |
| Changed in linux (Ubuntu Vivid): | |
| status: | In Progress → Fix Committed |
| Launchpad Janitor (janitor) wrote : | #9 |
This bug was fixed in the package linux - 4.1.0-3.3
---------------
linux (4.1.0-3.3) wily; urgency=low
[ Andy Whitcroft ]
* Release Tracking Bug
- LP: #1478897
[ Colin Ian King ]
* SAUCE: KEYS: ensure we free the assoc array edit if edit is valid
- CVE-2015-1333
[ Seth Forshee ]
* SAUCE: overlayfs: Enable user namespace mounts for the "overlay" fstype
- LP: #1478578
[ Upstream Kernel Changes ]
* sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
- LP: #1461620
* x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
* x86/nmi/64: Remove asm code that saves cr2
* x86/nmi/64: Switch stacks on userspace NMI entry
* x86/nmi/64: Reorder nested NMI checks
* x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI
detection
-- Andy Whitcroft <email address hidden> Tue, 28 Jul 2015 11:59:03 +0100
| Changed in linux (Ubuntu): | |
| status: | Fix Committed → Fix Released |
| Brad Figg (brad-figg) wrote : | #10 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-trusty |
| Brad Figg (brad-figg) wrote : | #11 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-vivid |
| Rafael David Tinoco (inaddy) wrote : | #12 |
Started verifying the fix.. will provide results soon.
| Rafael David Tinoco (inaddy) wrote : | #13 |
Trusty verification:
inaddy@
Linux sf00079894trusty 3.13.0-62-generic #101-Ubuntu SMP Thu Jul 30 09:01:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
inaddy@
74
In 5 seconds the logic was executed 74 times. I kept it running for quite sometime and it does not look like there is a regression. Marking this as verification-
| tags: |
added: verification-done-trusty removed: verification-needed-trusty |
| Rafael David Tinoco (inaddy) wrote : | #14 |
Vivid verification:
inaddy@
Linux sf00079894vivid 3.19.0-26-generic #27-Ubuntu SMP Tue Jul 28 18:27:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
inaddy@
46
In 5 seconds the logic was executed 46 times. I kept it running for quite sometime and it does not look like there is a regression. Marking this as verification-
Thank you
| tags: |
added: verification-done removed: verification-done-trusty verification-needed-vivid |
| tags: | added: sts |
| tags: | added: cts |
| Launchpad Janitor (janitor) wrote : | #15 |
This bug was fixed in the package linux - 3.19.0-26.28
---------------
linux (3.19.0-26.28) vivid; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1483630
[ Upstream Kernel Changes ]
* Revert "Bluetooth: ath3k: Add support of 04ca:300d AR3012 device"
linux (3.19.0-26.27) vivid; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1479055
* [Config] updateconfigs for 3.19.8-ckt4 stable update
[ Chris J Arges ]
* [Config] Add MTD_POWERNV_FLASH and OPAL_PRD
- LP: #1464560
[ Mika Kuoppala ]
* SAUCE: i915_bpo: drm/i915: Fix divide by zero on watermark update
- LP: #1473175
[ Tim Gardner ]
* [Config] ACORN_PARTITION=n
- LP: #1453117
* [Config] Add i40e[vf] to d-i
- LP: #1476393
[ Timo Aaltonen ]
* SAUCE: i915_bpo: Rebase to v4.2-rc3
- LP: #1473175
* SAUCE: i915_bpo: Revert "mm/fault, drm/i915: Use pagefault_
to check for disabled pagefaults"
- LP: #1473175
* SAUCE: i915_bpo: Revert "drm: i915: Port to new backlight interface
selection API"
- LP: #1473175
[ Upstream Kernel Changes ]
* Revert "tools/vm: fix page-flags build"
- LP: #1473547
* Revert "ALSA: hda - Add mute-LED mode control to Thinkpad"
- LP: #1473547
* Revert "drm/radeon: adjust pll when audio is not enabled"
- LP: #1473547
* Revert "crypto: talitos - convert to use be16_add_cpu()"
- LP: #1479048
* module: Call module notifier on failure after complete_
- LP: #1473547
* gpio: gpio-kempld: Fix get_direction return value
- LP: #1473547
* ARM: dts: imx27: only map 4 Kbyte for fec registers
- LP: #1473547
* ARM: 8356/1: mm: handle non-pmd-aligned end of RAM
- LP: #1473547
* x86/mce: Fix MCE severity messages
- LP: #1473547
* mac80211: don't use napi_gro_receive() outside NAPI context
- LP: #1473547
* iwlwifi: mvm: Free fw_status after use to avoid memory leak
- LP: #1473547
* iwlwifi: mvm: clean net-detect info if device was reset during suspend
- LP: #1473547
* drm/plane-helper: Adapt cursor hack to transitional helpers
- LP: #1473547
* ARM: dts: set display clock correctly for exynos4412-trats2
- LP: #1473547
* hwmon: (ntc_thermistor) Ensure iio channel is of type IIO_VOLTAGE
- LP: #1473547
* mfd: da9052: Fix broken regulator probe
- LP: #1473547
* ALSA: hda - Fix noise on AMD radeon 290x controller
- LP: #1473547
* lguest: fix out-by-one error in address checking.
- LP: #1473547
* xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
- LP: #1473547
* xfs: xfs_iozero can return positive errno
- LP: #1473547
* fs, omfs: add NULL terminator in the end up the token list
- LP: #1473547
* omfs: fix sign confusion for bitmap loop counter
- LP: #1473547
* d_walk() might skip too much
- LP: #1473547
* dm: fix casting bug in dm_merge_bvec()
- LP: #1473547
* hwmon: (nct6775) Add missing sysfs attribute initialization
- LP: #1473547
* hwmon: (nct6683) Add missing sysfs attribute initialization
- LP: #1473547
* target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
- LP: #1473547
* net...
| Changed in linux (Ubuntu Vivid): | |
| status: | Fix Committed → Fix Released |
| Rafael David Tinoco (inaddy) wrote : | #16 |
inaddy@mylinux ~/Work/
Ubuntu-3.13.0-60.99
Ubuntu-
Ubuntu-
Ubuntu-
Ubuntu-
Ubuntu-
linux-
linux-
linux-
This is already fixed. Updating case status.
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |


You can follow my comments in LKML:
https:/ /lkml.org/ lkml/2015/ 3/6/484
"""
Basically in kernel 3.13 we are getting the follow situation:
I have a core dump locked on the same place
(state machine for powering cpu down for the task swap) from a 3.13 (+
upstream patches) and this commit wasn't backported yet.
-> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).
Register totally messed up (probably after cpu_relax(), right where
you were trapped -> after the pause instruction).
my case:
PID: 118 TASK: ffff883fd28ec7d0 CPU: 9 COMMAND: "migration/9" stop+0x64]
...
[exception RIP: multi_cpu_
RIP: ffffffff810f5944 RSP: ffff883fd2907d98 RFLAGS: 00000246
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
RDX: ffff883fd2907d98 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffffffff810f5944 R8: ffffffff810f5944 R9: 0000000000000000
R10: ffff883fd2907d98 R11: 0000000000000246 R12: ffffffffffffffff
R13: ffff883f55d01b48 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: 0000000000000001 CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
208 } while (curstate != MULTI_STOP_EXIT);
---> RIP
RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx
---> CHECKING FOR MULTI_STOP_EXIT
RDX: ffff883fd2907d98 -> does not make any sense
###
If i'm reading this right,
"""
CPU 05 - PID 14990
do_numa_page preferred
task_numa_fault
numa_migrate_
task_numa_migrate
migrate_swap (curr: 14990, task: 14996)
stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
wait_for_completion
14990 - CPU05
14996 - CPU00
stop_two_cpus: call_function_ single (min=cpu2=00, irq_cpu_ stop_queue_ work, wait=1)
smp_call_ function_ single (ran on lowest CPU, 00 for this case)
irq_cpu_ stop_queue_ work
cpu_ stop_queue_ work(cpu1= 05(14996) ) # add work
cpu_ stop_queue_ work(cpu2= 00(14990) ) # add work for_completion( ) --> HERE
multi_stop_data (msdata->state = MULTI_STOP_PREPARE)
smp_
(multi_cpu_stop) to cpu 05 cpu_stopper queue
(multi_cpu_stop) to cpu 00 cpu_stopper queue
wait_
"""
in my case, checking task structs for tasks scheduled when for_completion( )":
"waiting_
PID 14990 CPU 05 -> PID 14996 CPU 00
PID 14991 CPU 30 -> PID 14998 CPU 01
PID 14992 CPU 30 -> PID 14998 CPU 01
PID 14996 CPU 00 -> PID 14992 CPU 30
PID 14998 CPU 01 -> PID 14990 CPU 05
AND
> 102 2 6 ffff881fd2ea97f0 RU 0.0 0 0 [migration/6]
> 118 2 9 ffff883fd28ec7d0 RU 0.0 0 0 [migration/9]
> 143 2 14 ffff883fd29d47d0 RU 0.0 0 0 [migration/14]
> 148 2 15 ffff883fd29fc7d0 RU 0.0 0 0 [migration/15]
> 153 2 16 ffff881fd2f517f0 RU 0.0 0 0 [migration/16]
THEN
I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
running yet.
AND
I don't have any "wait_for_ completion" for those "OLDER" migration
threads (6, 9, 14, 15 and 16)
Probably wait_for_completion s...