Ubuntu18.04:pKVM - Host in hung state and out of network after few hours of stress run on all guests
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
Artful |
Won't Fix
|
High
|
Canonical Kernel Team | ||
Bionic |
Fix Released
|
Critical
|
Canonical Kernel Team |
Bug Description
== Comment: #0 - INDIRA P. JOGA <email address hidden> - 2018-02-11 12:37:25 ==
Problem Description:
===================
After few hours of run system is in hung state with, "rcu_sched detected stalls on CPUs/tasks" messages on the host IPMI console and host is out of network .
Steps to re-create:
==================
> Installed Ubuntu1804 on boslcp3 host.
root@boslcp3:~# uname -a
Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# uname -r
4.13.0-25-generic
> root@boslcp3:~# ppc64_cpu --smt
SMT is off
> Hugepage set up
echo 8500 > /proc/sys/
> Defined the guests from host machine
Id Name State
-------
2 boslcp3g2 shut off
3 boslcp3g3 shut off
4 boslcp3g4 shut off
6 boslcp3g1 shut off
7 boslcp3g5 shut off
> Started and installed ubuntu1804 daily build on all the guests.
root@boslcp3:~# virsh list --all
Id Name State
-------
2 boslcp3g2 running
3 boslcp3g3 running
4 boslcp3g4 running
6 boslcp3g1 running
7 boslcp3g5 running
> Started regression run (IO_BASE_TCP_NFS) tests on all 5 guests.
NOTE: Removed madvise test case from BASE focus areas.
> Run went fine for few hours on all guests.
> After few hours of run ,Host system is in hung state and host console dumps CPU stall messages as below
[SOL Session operational. Use ~? for help]
[250867.133429] INFO: rcu_sched detected stalls on CPUs/tasks:
[250867.133499] (detected by 86, t=62711832 jiffies, g=497, c=496, q=31987857)
[250867.133554] All QSes seen, last rcu_sched kthread activity 62711828 (4357609080-
[250867.133690] rcu_sched kthread starved for 62711828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100
[250931.133433] INFO: rcu_sched detected stalls on CPUs/tasks:
[250931.133494] (detected by 3, t=62727832 jiffies, g=497, c=496, q=31995625)
[250931.133572] All QSes seen, last rcu_sched kthread activity 62727828 (4357625080-
[250931.133741] rcu_sched kthread starved for 62727828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100
[250995.133432] INFO: rcu_sched detected stalls on CPUs/tasks:
[250995.133480] (detected by 54, t=62743832 jiffies, g=497, c=496, q=32004479)
[250995.133526] All QSes seen, last rcu_sched kthread activity 62743828 (4357641080-
[250995.133645] rcu_sched kthread starved for 62743828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100
> Not able to get the prompt
> Ping /shh to boslcp3 also fails
[ipjoga@kte ~]$ ping boslcp3
PING boslcp3.
From kte.isst.
From kte.isst.
From kte.isst.
[ipjoga@kte ~]$ ssh root@boslcp3
ssh: connect to host boslcp3 port 22: No route to host
> boslcp3 is not reachable
> Attached boslcp3 host console logs
== Comment: #1 - INDIRA P. JOGA <email address hidden> - 2018-02-11 12:39:29 ==
Added Host console logs
== Comment: #24 - VIPIN K. PARASHAR <email address hidden> - 2018-02-16 05:46:13 ==
From Linux logs
===========
[72072.290071] watchdog: BUG: soft lockup - CPU#132 stuck for 22s! [CPU 12/KVM:15579]
[72072.290218] CPU: 132 PID: 15579 Comm: CPU 12/KVM Tainted: G W L 4.13.0-32-generic #35-Ubuntu
[72072.290220] task: c000200debf82e00 task.stack: c000200e140f8000
[72072.290221] NIP: c000000000c779e0 LR: c0080000166893a0 CTR: c000000000c77980
[72072.290223] REGS: c000200e140fb790 TRAP: 0901 Tainted: G W L (4.13.0-32-generic)
[72072.290224] MSR: 900000000280b033 <SF,HV,
[72072.290235] CR: 28024224 XER: 00000000
[72072.290236] CFAR: c000000000c779fc SOFTE: 1
[72072.290256] NIP [c000000000c779e0] _raw_spin_
[72072.290261] LR [c0080000166893a0] kvmppc_
[72072.290262] Call Trace:
[72072.290263] [c000200e140fba10] [00000000ffffffff] 0xffffffff (unreliable)
[72072.290270] [c000200e140fba40] [c008000016689334] kvmppc_
[72072.290276] [c000200e140fbaa0] [c00800001668ab0c] kvmppc_
[72072.290287] [c000200e140fbb10] [c0080000162559cc] kvmppc_
[72072.290299] [c000200e140fbb30] [c008000016251a80] kvm_arch_
[72072.290309] [c000200e140fbbd0] [c008000016245018] kvm_vcpu_
[72072.290313] [c000200e140fbd40] [c0000000003c1a64] do_vfs_
[72072.290316] [c000200e140fbde0] [c0000000003c2454] SyS_ioctl+
[72072.290320] [c000200e140fbe30] [c00000000000b184] system_
[72072.290321] Instruction dump:
[72072.290323] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 ebe1fff8
[72072.290331] 4e800020 60000000 60000000 60420000 <7c210b78> e92d0000 89290009 71290002
[72084.110070] watchdog: BUG: soft lockup - CPU#80 stuck for 23s! [CPU 11/KVM:15578]
[72084.110223] CPU: 80 PID: 15578 Comm: CPU 11/KVM Tainted: G W L 4.13.0-32-generic #35-Ubuntu
[72084.110225] task: c000200debfa3f00 task.stack: c000200e14080000
[72084.110226] NIP: c000000000c779e4 LR: c0080000166893a0 CTR: c000000000c77980
[72084.110229] REGS: c000200e14083790 TRAP: 0901 Tainted: G W L (4.13.0-32-generic)
[72084.110230] MSR: 900000000280b033 <SF,HV,
[72084.110240] CR: 28024824 XER: 00000000
[72084.110241] CFAR: c000000000c779fc SOFTE: 1
[72084.110260] NIP [c000000000c779e4] _raw_spin_
[72084.110266] LR [c0080000166893a0] kvmppc_
[72084.110267] Call Trace:
[72084.110269] [c000200e14083a10] [00000000ffffffff] 0xffffffff (unreliable)
[72084.110275] [c000200e14083a40] [c008000016689334] kvmppc_
[72084.110280] [c000200e14083aa0] [c00800001668ab0c] kvmppc_
[72084.110292] [c000200e14083b10] [c0080000162559cc] kvmppc_
[72084.110303] [c000200e14083b30] [c008000016251a80] kvm_arch_
[72084.110314] [c000200e14083bd0] [c008000016245018] kvm_vcpu_
[72084.110318] [c000200e14083d40] [c0000000003c1a64] do_vfs_
[72084.110321] [c000200e14083de0] [c0000000003c2454] SyS_ioctl+
[72084.110325] [c000200e14083e30] [c00000000000b184] system_
[72084.110326] Instruction dump:
[72084.110328] 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 ebe1fff8 4e800020
[72084.110335] 60000000 60000000 60420000 7c210b78 <e92d0000> 89290009 71290002 40820050
From kernel logs, i see that most of CPUs reporting soft-lockup are busy executing
kvmppc_
From xmon, i also examined all CPUs and most of them were either busy executing
kvmppc_
but it didn't complete. Soft-lockup got logged upon reboot post crash and OS never made back
to login prompt. Leaving machine is same state.
== Comment: #38 - Paul Mackerras <email address hidden> - 2018-03-01 00:11:04 ==
I looked at this system today. It was in a state where characters typed on the console would echo but it did not otherwise respond. Fortunately, ^O x was able to get me into xmon.
CPU 0x54 is in _raw_spin_lock called from kvmppc_
I have left the system in xmon for now.
== Comment: #40 - Paul Mackerras <email address hidden> - 2018-03-02 05:39:03 ==
CPU 0x60 was stuck in the spin_lock call inside kvm_arch_
Did you have the indep_threads_mode parameter set to N on this machine when the problem occurred?
== Comment: #45 - INDIRA P. JOGA <email address hidden> - 2018-03-06 01:33:26 ==
From host machine, while tests are running on guests
root@boslcp3:~# cat /sys/module/
Y
root@boslcp3:
== Comment: #47 - Paul Mackerras <email address hidden> - 2018-03-07 01:10:42 ==
This patch fixes a bug in the assembly code in the KVM guest exit path which could cause the trap number in r12 to get overwritten. The effect of this would be that if the exit is due to a doorbell interrupt from another CPU, we would effectively discard the doorbell interrupt and not handle it after returning to host context. I think this is why we are seeing CPUs locked up in smp_call_
== Comment: #116 - INDIRA P. JOGA <email address hidden> - 2018-03-21 04:42:09 ==
(In reply to comment #115)
> (In reply to comment #113)
> > After 2.2 upgrade, run completed 62 hours with 2 guests running. It looks
> > the system is stable now.
>
> Hi Indira/Chethan,
>
> Can you please let know test run final status ?
Hi Vipin,
After 2.2 upgrade , 2 guests completed 72 hours of run. Host system is doing fine along with guests without crash/hang.
Regards,
Indira
== Comment: #119 - VIPIN K. PARASHAR <email address hidden> - 2018-03-21 05:33:19 ==
(In reply to comment #47)
> Created attachment 125084 [details]
> Patch to return trap number correctly from __kvmppc_
>
> This patch fixes a bug in the assembly code in the KVM guest exit path which
> could cause the trap number in r12 to get overwritten. The effect of this
> would be that if the exit is due to a doorbell interrupt from another CPU,
> we would effectively discard the doorbell interrupt and not handle it after
> returning to host context. I think this is why we are seeing CPUs locked up
> in smp_call_
> in a KVM guest, hence that other CPU never ran the requested function, hence
> the CPU doing smp_call_
> that CPU was holding spinlocks and then other CPUs would wait forever in
> raw_spin_lock().
Commit a8b48a4dcc contains the fix with upstream kernel.
It reads..
KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_
break-fixes: fd7bacbca47a86a
Changed in ubuntu-power-systems: | |
importance: | Undecided → Critical |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in linux (Ubuntu): | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in linux (Ubuntu Bionic): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
description: | updated |
tags: | added: kernel-bug-break-fix |
Changed in linux (Ubuntu Bionic): | |
status: | Triaged → Fix Committed |
Changed in linux (Ubuntu Artful): | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
Changed in linux (Ubuntu Artful): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
Changed in ubuntu-power-systems: | |
status: | Triaged → Incomplete |
Changed in linux (Ubuntu Artful): | |
status: | Incomplete → Won't Fix |
Changed in ubuntu-power-systems: | |
status: | Incomplete → Fix Released |
Default Comment by Bridge