ISST-LTE:KVM:Ubuntu18.04:BostonLC:boslcp3:boslcp3g3:Guest conosle hangs after hotplug CPU add operation.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
Bionic |
Fix Released
|
Critical
|
Canonical Kernel Team |
Bug Description
Problem Description:
===================
Performed HOTPLUG cpu attach operation for the guest and guest console becomes unresponsive.
Steps to re-create:
==================
1. updated boslcp3 host BMC :116 & PNOR: 20180302 levels
2. Installed Ubuntu1804 on boslcp3 host & guests with trap issue fixes
root@boslcp3:/home# uname -a
Linux boslcp3 4.15.0-12-generic #13+leo20180320 SMP Tue Mar 20 13:10:42 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:/home# uname -r
4.15.0-12-generic
root@boslcp3g3:
Linux boslcp3g3 4.15.0-12-generic #13+leo20180320 SMP Tue Mar 20 13:10:42 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3g3:
4.15.0-12-generic
3. Started HTX & stress-ng for on guest for 10-15 min
4. Cleaned up the tests to perform hot-plug and ensure enough memory and cpu was there (killed all Process using kill)
5. Performed cpu hot-plug and guest went into hung state
Before Hotplug:
root@boslcp3:~# virsh dumpxml boslcp3g3 | grep vcpu
<vcpu placemen
Hotplug add CPU:
root@boslcp3:~# virsh setvcpus boslcp3g3 48 --live
dumpxml:
root@boslcp3:~# virsh dumpxml boslcp3g3 | grep cpu
<vcpu placement='static' current=
<vcpus>
<vcpu id='0' enabled='yes' hotpluggable='no' order='1'/>
<vcpu id='1' enabled='yes' hotpluggable='no' order='1'/>
<vcpu id='2' enabled='yes' hotpluggable='no' order='1'/>
<vcpu id='3' enabled='yes' hotpluggable='no' order='1'/>
<vcpu id='4' enabled='yes' hotpluggable='no' order='2'/>
<vcpu id='5' enabled='yes' hotpluggable='no' order='2'/>
<vcpu id='6' enabled='yes' hotpluggable='no' order='2'/>
<vcpu id='7' enabled='yes' hotpluggable='no' order='2'/>
<vcpu id='8' enabled='yes' hotpluggable='no' order='3'/>
<vcpu id='9' enabled='yes' hotpluggable='no' order='3'/>
<vcpu id='10' enabled='yes' hotpluggable='no' order='3'/>
<vcpu id='11' enabled='yes' hotpluggable='no' order='3'/>
<vcpu id='12' enabled='yes' hotpluggable='no' order='4'/>
<vcpu id='13' enabled='yes' hotpluggable='no' order='4'/>
<vcpu id='14' enabled='yes' hotpluggable='no' order='4'/>
<vcpu id='15' enabled='yes' hotpluggable='no' order='4'/>
<vcpu id='16' enabled='yes' hotpluggable='no' order='5'/>
<vcpu id='17' enabled='yes' hotpluggable='no' order='5'/>
<vcpu id='18' enabled='yes' hotpluggable='no' order='5'/>
<vcpu id='19' enabled='yes' hotpluggable='no' order='5'/>
<vcpu id='20' enabled='yes' hotpluggable='no' order='6'/>
<vcpu id='21' enabled='yes' hotpluggable='no' order='6'/>
<vcpu id='22' enabled='yes' hotpluggable='no' order='6'/>
<vcpu id='23' enabled='yes' hotpluggable='no' order='6'/>
<vcpu id='24' enabled='yes' hotpluggable='no' order='7'/>
<vcpu id='25' enabled='yes' hotpluggable='no' order='7'/>
<vcpu id='26' enabled='yes' hotpluggable='no' order='7'/>
<vcpu id='27' enabled='yes' hotpluggable='no' order='7'/>
<vcpu id='28' enabled='yes' hotpluggable='no' order='8'/>
<vcpu id='29' enabled='yes' hotpluggable='no' order='8'/>
<vcpu id='30' enabled='yes' hotpluggable='no' order='8'/>
<vcpu id='31' enabled='yes' hotpluggable='no' order='8'/>
<vcpu id='32' enabled='yes' hotpluggable='yes' order='9'/>
<vcpu id='33' enabled='yes' hotpluggable='yes' order='9'/>
<vcpu id='34' enabled='yes' hotpluggable='yes' order='9'/>
<vcpu id='35' enabled='yes' hotpluggable='yes' order='9'/>
<vcpu id='36' enabled='yes' hotpluggable='yes' order='10'/>
<vcpu id='37' enabled='yes' hotpluggable='yes' order='10'/>
<vcpu id='38' enabled='yes' hotpluggable='yes' order='10'/>
<vcpu id='39' enabled='yes' hotpluggable='yes' order='10'/>
<vcpu id='40' enabled='yes' hotpluggable='yes' order='11'/>
<vcpu id='41' enabled='yes' hotpluggable='yes' order='11'/>
<vcpu id='42' enabled='yes' hotpluggable='yes' order='11'/>
<vcpu id='43' enabled='yes' hotpluggable='yes' order='11'/>
<vcpu id='44' enabled='yes' hotpluggable='yes' order='12'/>
<vcpu id='45' enabled='yes' hotpluggable='yes' order='12'/>
<vcpu id='46' enabled='yes' hotpluggable='yes' order='12'/>
<vcpu id='47' enabled='yes' hotpluggable='yes' order='12'/>
<vcpu id='48' enabled='no' hotpluggable=
<vcpu id='49' enabled='no' hotpluggable=
<vcpu id='50' enabled='no' hotpluggable=
<vcpu id='51' enabled='no' hotpluggable=
<vcpu id='52' enabled='no' hotpluggable=
<vcpu id='53' enabled='no' hotpluggable=
<vcpu id='54' enabled='no' hotpluggable=
<vcpu id='55' enabled='no' hotpluggable=
<vcpu id='56' enabled='no' hotpluggable=
<vcpu id='57' enabled='no' hotpluggable=
<vcpu id='58' enabled='no' hotpluggable=
<vcpu id='59' enabled='no' hotpluggable=
<vcpu id='60' enabled='no' hotpluggable=
<vcpu id='61' enabled='no' hotpluggable=
<vcpu id='62' enabled='no' hotpluggable=
<vcpu id='63' enabled='no' hotpluggable=
</vcpus>
<cpu mode='host-model' check='partial'>
</cpu>
root@boslcp3:~#
6. After this operation, guest becomes unrepsonsive as below
root@boslcp3g3:~# [ 3626.140773] INFO: task jbd2/vda2-8:584 blocked for more than 120 seconds.
[ 3626.146375] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.146457] "echo 0 > /proc/sys/
[ 3626.146624] INFO: task systemd-journal:665 blocked for more than 120 seconds.
[ 3626.146699] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.146768] "echo 0 > /proc/sys/
[ 3626.146939] INFO: task rs:main Q:Reg:1995 blocked for more than 120 seconds.
[ 3626.147016] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.147088] "echo 0 > /proc/sys/
[ 3626.147285] INFO: task kworker/
[ 3626.147361] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.147434] "echo 0 > /proc/sys/
[ 3626.147622] INFO: task smbd:1449 blocked for more than 120 seconds.
[ 3626.147686] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.147760] "echo 0 > /proc/sys/
[ 3626.147875] INFO: task smbd:1452 blocked for more than 120 seconds.
[ 3626.147937] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.148010] "echo 0 > /proc/sys/
[ 3626.148110] INFO: task smbd:1454 blocked for more than 120 seconds.
[ 3626.148173] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.148245] "echo 0 > /proc/sys/
[ 3626.148344] INFO: task cron:1461 blocked for more than 120 seconds.
[ 3626.148406] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3626.148488] "echo 0 > /proc/sys/
root@boslcp3g3:~#
root@boslcp3g3:~# ps -ef | grep stress-ng
[ 3746.978098] INFO: task jbd2/vda2-8:584 blocked for more than 120 seconds.
[ 3746.978221] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3746.978301] "echo 0 > /proc/sys/
[ 3746.978447] INFO: task systemd-journal:665 blocked for more than 120 seconds.
[ 3746.978534] Tainted: G W 4.15.0-12-generic #13+leo20180320
[ 3746.978607] "echo 0 > /proc/sys/
[ 4446.361899] systemd[1]: Failed to start Journal Service.
[ 4897.632142] systemd[1]: Failed to start Journal Service.
^Z
^X
^C
^Z
^X
^C
7. ping to boslcp3g3 is fine but guest console is not repsonding
[ipjoga@kte (AUS) ~]$ ping boslcp3g3
PING boslcp3g3.
64 bytes from boslcp3g3.
64 bytes from boslcp3g3.
^C
8. Took dump for the guest, attache vmcore & other logs.
Thanks to the Linux block community, I'm now aware of two commits that should fix this issue.
https:/
blk-mq: simplify queue mapping & schedule with each possisble CPU
The previous patch assigns interrupt vectors to all possible CPUs, so
now hctx can be mapped to possible CPUs, this patch applies this fact
to simplify queue mapping & schedule so that we don't need to handle
CPU hotplug for dealing with physical CPU plug & unplug. With this
simplication, we can work well on physical CPU plug & unplug, which
is a normal use case for VM at least.
Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
set hctx->numa_node for possible CPUs which are mapped to this hctx. And
only choose the online CPUs for schedule.
https:/
genirq/affinity: assign vectors to all possible CPUs
Currently we assign managed interrupt vectors to all present CPUs. This
works fine for systems were we only online/offline CPUs. But in case of
systems that support physical CPU hotplug (or the virtualized version of
it) this means the additional CPUs covered for in the ACPI tables or on
the command line are not catered for. To fix this we'd either need to
introduce new hotplug CPU states just for this case, or we can start
assining vectors to possible but not present CPUs.
CVE References
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in linux (Ubuntu): | |
status: | New → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | Triaged → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in ubuntu-power-systems: | |
status: | Fix Released → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
Changed in linux (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu): | |
importance: | Undecided → Critical |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Critical |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
tags: | added: cscc |
Default Comment by Bridge