smp_call_function_single/many core hangs with stop4 alone
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
Critical
|
Unassigned | ||
Bionic |
Fix Released
|
Critical
|
Unassigned |
Bug Description
== SRU Justification ==
IBM reports that this bug occurs with stop4 which results in soft lockups/rcu stalls.
This is a kernel synchronization issue leading to a dead lock.
This bug was introduced by commit 7bc54b652f13 in v4.8-rc1. This
regression is fixed by mainline commit c0f7f5b6c6910.
== Fix ==
c0f7f5b6c6910 ("cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt")
== Regression Potential ==
Low. Fixes current regression. Cc'd to upstream stable, so it has had
additon upstream review.
== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
Recently we discovered this bug occurs just alone with stop4 which results in soft lockups/rcu stalls.
```
root@ltc-
[15523.619508] systemd[1]: systemd-
[15523.619769] systemd[1]: Failed to start Journal Service.
[15523.620618] systemd[1]: systemd-
[15523.620774] systemd[1]: systemd-
[15523.621462] systemd[1]: Stopped Journal Service.
[15523.621635] systemd[1]: systemd-
[15523.621756] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[15523.621888] systemd[1]: systemd-
[15523.622029] systemd[1]: This usually indica[
[15541.629958] 60-....: (2 GPs behind) idle=146/
[15541.630046] (t=2415546 jiffies g=184827 c=184826 q=57111)
[15541.630101] NMI backtrace for cpu 60
[15541.630135] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G L 4.15.0-15-generic #16-Ubuntu
[15541.630207] Call Trace:
[15541.630232] [c000201a1da96b00] [c000000000ceb35c] dump_stack+
[15541.630298] [c000201a1da96b40] [c000000000cf4d48] nmi_cpu_
[15541.630363] [c000201a1da96bd0] [c000000000cf4ee8] nmi_trigger_
[15541.630429] [c000201a1da96c60] [c00000000002f2d8] arch_trigger_
[15541.630495] [c000201a1da96c80] [c0000000001a913c] rcu_dump_
[15541.630560] [c000201a1da96cd0] [c0000000001a81e8] rcu_check_
[15541.630625] [c000201a1da96e00] [c0000000001b64a8] update_
[15541.630689] [c000201a1da96e30] [c0000000001ce1f4] tick_sched_
[15541.630753] [c000201a1da96e60] [c0000000001ce2f0] tick_sched_
[15541.630818] [c000201a1da96ea0] [c0000000001b7054] __hrtimer_
[15541.630883] [c000201a1da96f20] [c0000000001b7fac] hrtimer_
[15541.630948] [c000201a1da96ff0] [c0000000000248f0] __timer_
[15541.631013] [c000201a1da97040] [c000000000024d08] timer_interrupt
[15541.631069] [c000201a1da97070] [c000000000009014] decrementer_
[15541.631135] --- interrupt: 901 at smp_call_
[15541.631135] LR = smp_call_
[15541.631230] [c000201a1da973d0] [c0000000001d55e0] smp_call_
[15541.631294] [c000201a1da97430] [c000000000acd3e8] gpstate_
[15541.631359] [c000201a1da974e0] [c0000000001b46b0] call_timer_
[15541.631433] [c000201a1da97560] [c0000000001b4958] expire_
[15541.631488] [c000201a1da975d0] [c0000000001b4bf8] run_timer_
[15541.631553] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+
[15541.631608] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
[15541.631663] [c000201a1da97770] [c000000000024d0c] timer_interrupt
[15541.631718] [c000201a1da977a0] [c000000000009014] decrementer_
[15541.631784] --- interrupt: 901 at smp_call_
[15541.631784] LR = smp_call_
[15541.631879] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate
[15541.631935] [c000201a1da97b30] [c0000000003a1120] change_
[15541.632000] [c000201a1da97ba0] [c000000000349278] change_
[15541.632065] [c000201a1da97cf0] [c0000000003496c0] mprotect_
[15541.632129] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+
[15541.632185] [c000201a1da97e30] [c00000000000b184] system_
[15579.001651] watchdog: BUG: soft lockup - CPU#52 stuck for 23s! [grep:69263]
[15579.001738] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_
[15579.002363] i2c_algo_bit hid_generic ttm drm_kms_helper mpt3sas syscopyarea sysfillrect usbhid sysimgblt fb_sys_fops hid raid_class crct10dif_vpmsum crc32c_vpmsum drm i40e aacraid scsi_transport_sas
[15579.002524] CPU: 52 PID: 69263 Comm: grep Tainted: G L 4.15.0-15-generic #16-Ubuntu
[15579.002598] NIP: c0000000001d5368 LR: c0000000001d5340 CTR: c000000000acc7f0
[15579.002664] REGS: c000003e84eff7e0 TRAP: 0901 Tainted: G L (4.15.0-15-generic)
[15579.002735] MSR: 9000000000009033 <SF,HV,
[15579.002810] CFAR: c01721ed8
[15579.002810] GPR08: c000000001721ed8 0000000000000001 c009e006592e0960 0000000000000000
[15579.002810] GPR12: c000000000acc7f0 c00000000faa3c00
[15579.003084] NIP [c0000000001d5368] smp_call_
[15579.003139] LR [c0000000001d5340] smp_call_
[15579.003191] Call Trace:
[15579.003217] [c000003e84effa60] [c0000000001d5340] smp_call_
[15579.003298] [c000003e84effad0] [c0000000001d55e0] smp_call_
[15579.003381] [c000003e84effb30] [c000000000acc840] powernv_
[15579.003447] [c000003e84effb60] [c000000000ac2b8c] __cpufreq_
[15579.003503] [c000003e84effba0] [c000000000ac2d18] cpufreq_
[15579.003560] [c000003e84effbe0] [c00000000009da50] pnv_get_
[15579.003625] [c000003e84effc00] [c0000000000283bc] show_cpuinfo+
[15579.003680] [c000003e84effca0] [c00000000040c738] seq_read+
[15579.003737] [c000003e84effd40] [c00000000047fa38] proc_reg_
[15579.003794] [c000003e84effd70] [c0000000003d293c] __vfs_read+
[15579.003849] [c000003e84effd90] [c0000000003d2a2c] vfs_read+0xbc/0x1b0
[15579.003905] [c000003e84effde0] [c0000000003d3028] SyS_read+0x68/0x110
[15579.003962] [c000003e84effe30] [c00000000000b184] system_
[15579.004016] Instruction dump:
[15579.004051] 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 60000000 60000000
[15579.004121] 60000000 60420000 7c210b78 7c421378 <813f0018> 71290001 4082fff0 7c2004ac
[15604.648202] INFO: rcu_sched self-detected stall on CPU
[15604.648260] 60-....: (2 GPs behind) idle=146/
[15604.648332] (t=2431300 jiffies g=184827 c=184826 q=57308)
[15604.648385] NMI backtrace for cpu 60
[15604.648419] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G L 4.15.0-15-generic #16-Ubuntu
[15604.648491] Call Trace:
[15604.648515] [c000201a1da96b00] [c000000000ceb35c] dump_stack+
[15604.648581] [c000201a1da96b40] [c000000000cf4d48] nmi_cpu_
[15604.648647] [c000201a1da96bd0] [c000000000cf4ee8] nmi_trigger_
[15604.648728] [c000201a1da96c60] [c00000000002f2d8] arch_trigger_
[15604.648793] [c000201a1da96c80] [c0000000001a913c] rcu_dump_
[15604.648858] [c000201a1da96cd0] [c0000000001a81e8] rcu_check_
[15604.648924] [c000201a1da96e00] [c0000000001b64a8] update_
[15604.648988] [c000201a1da96e30] [c0000000001ce1f4] tick_sched_
[15604.649052] [c000201a1da96e60] [c0000000001ce2f0] tick_sched_
[15604.649118] [c000201a1da96ea0] [c0000000001b7054] __hrtimer_
[15604.649183] [c000201a1da96f20] [c0000000001b7fac] hrtimer_
[15604.649248] [c000201a1da96ff0] [c0000000000248f0] __timer_
[15604.649313] [c000201a1da97040] [c000000000024d08] timer_interrupt
[15604.649369] [c000201a1da97070] [c000000000009014] decrementer_
[15604.649435] --- interrupt: 901 at smp_call_
[15604.649435] LR = smp_call_
[15604.649530] [c000201a1da973d0] [c0000000001d55e0] smp_call_
[15604.649595] [c000201a1da97430] [c000000000acd3e8] gpstate_
[15604.649660] [c000201a1da974e0] [c0000000001b46b0] call_timer_
[15604.649715] [c000201a1da97560] [c0000000001b4958] expire_
[15604.649770] [c000201a1da975d0] [c0000000001b4bf8] run_timer_
[15604.649835] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+
[15604.649891] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
[15604.649946] [c000201a1da97770] [c000000000024d0c] timer_interrupt
[15604.650002] [c000201a1da977a0] [c000000000009014] decrementer_
[15604.650084] --- interrupt: 901 at smp_call_
[15604.650084] LR = smp_call_
[15604.650179] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate
[15604.650235] [c000201a1da97b30] [c0000000003a1120] change_
[15604.650301] [c000201a1da97ba0] [c000000000349278] change_
[15604.650366] [c000201a1da97cf0] [c0000000003496c0] mprotect_
[15604.650430] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+
[15604.650486] [c000201a1da97e30] [c00000000000b184] system_
[15667.666494] INFO: rcu_sched self-detected stall on CPU
[15667.666550] 60-....: (2 GPs behind) idle=146/
[15667.666622] (t=2447054 jiffies g=184827 c=184826 q=57457)
[15667.666675] NMI backtrace for cpu 60
[15667.666709] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G L 4.15.0-15-generic #16-Ubuntu
[15667.666781] Call Trace:
[15667.666805] [c000201a1da96b00] [c000000000ceb35c] dump_stack+
[15667.666871] [c000201a1da96b40] [c000000000cf4d48] nmi_cpu_
[15667.666937] [c000201a1da96bd0] [c000000000cf4ee8] nmi_trigger_
[15667.667002] [c000201a1da96c60] [c00000000002f2d8] arch_trigger_
[15667.667086] [c000201a1da96c80] [c0000000001a913c] rcu_dump_
[15667.667151] [c000201a1da96cd0] [c0000000001a81e8] rcu_check_
[15667.667216] [c000201a1da96e00] [c0000000001b64a8] update_
[15667.667280] [c000201a1da96e30] [c0000000001ce1f4] tick_sched_
[15667.667344] [c000201a1da96e60] [c0000000001ce2f0] tick_sched_
[15667.667409] [c000201a1da96ea0] [c0000000001b7054] __hrtimer_
[15667.667474] [c000201a1da96f20] [c0000000001b7fac] hrtimer_
[15667.667539] [c000201a1da96ff0] [c0000000000248f0] __timer_
[15667.667604] [c000201a1da97040] [c000000000024d08] timer_interrupt
[15667.667660] [c000201a1da97070] [c000000000009014] decrementer_
[15667.667727] --- interrupt: 901 at smp_call_
[15667.667727] LR = smp_call_
[15667.667821] [c000201a1da973d0] [c0000000001d55e0] smp_call_
[15667.667886] [c000201a1da97430] [c000000000acd3e8] gpstate_
[15667.667951] [c000201a1da974e0] [c0000000001b46b0] call_timer_
[15667.668006] [c000201a1da97560] [c0000000001b4958] expire_
[15667.668061] [c000201a1da975d0] [c0000000001b4bf8] run_timer_
[15667.668126] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+
[15667.668181] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
[15667.668236] [c000201a1da97770] [c000000000024d0c] timer_interrupt
[15667.668292] [c000201a1da977a0] [c000000000009014] decrementer_
[15667.668358] --- interrupt: 901 at smp_call_
[15667.668358] LR = smp_call_
[15667.668469] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate
[15667.668524] [c000201a1da97b30] [c0000000003a1120] change_
[15667.668589] [c000201a1da97ba0] [c000000000349278] change_
[15667.668654] [c000201a1da97cf0] [c0000000003496c0] mprotect_
[15667.668719] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+
[15667.668775] [c000201a1da97e30] [c00000000000b184] system_
```
Per feedback from Vaidy, this currently appears to NOT be a firmware problem. This seems to be a kernel synchronization issue leading to a dead lock.
-------
Fix identified by Shilpa as per Nick Piggin's recommendation. Kernel fix is currently being tested.
-------
Fix upstream in 4.17-rc3
https:/
cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer interrupt
Posted to stable as well.
Mirroring to Launchpad for Canonical to pull in commit.
tags: | added: architecture-ppc64le bugnameltc-166937 severity-critical targetmilestone-inin1804 |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in linux (Ubuntu): | |
importance: | Undecided → Critical |
status: | New → Triaged |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Critical |
status: | New → Triaged |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Bionic): | |
status: | Triaged → In Progress |
Changed in linux (Ubuntu): | |
status: | Triaged → In Progress |
Changed in ubuntu-power-systems: | |
status: | Triaged → In Progress |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Fix Committed → Fix Released |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
------- Comment From <email address hidden> 2018-05-05 10:31 EDT-------
Yesterday, the decision was made at Padma's daily KVM meeting to only track System Firmware Mustfix issues using the LC GA1 Mustfix label since that is all that applies to the Supermicro team. The OS Kernel/KVM issues will be managed with a spreadsheet tracked by the KVM team and also in the internal slack channel. Removing the Mustfix label.