cpu hotplug crashes the guest!

Bug #2076587 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
In Progress
High
Ubuntu on IBM Power Systems Bug Triage
qemu (Ubuntu)
Status tracked in Oracular
Noble
Triaged
High
Sergio Durigan Junior
Oracular
Fix Released
High
Sergio Durigan Junior

Bug Description

SRU Justification:

[ Impact ]

 * While running a (nested) KVM guest on Power 10 (with PowerVM)
   and performing a CPU hotplug, trying to set to 68 vCPUs,
   the KVM guest crashes.

 * In the failure case the KVM guest has maxvcpus 128,
   and it starts fine with an initial value of 4 vCPUs,
   but fails after a larger increase (here to 68 vCPUs).

 * The error reported is:
   [ 662.102542] KVM: Create Guest vcpu hcall failed, rc=-44
   error: Unable to read from monitor: Connection reset by peer

 * This especially seems to happen in memory constraint systems.

 * This can be avoided by pre-creating and parking vCPUs on success
   or return error otherwise, which then leads to a graceful error
   in case of a vCPU hotplug failure, while the guest keeps running.

[ Fix ]

 * 08c3286822 ("accel/kvm: Extract common KVM vCPU {creation,parking} code") [pre-req]

 * c6a3d7bc9e ("accel/kvm: Introduce kvm_create_and_park_vcpu() helper")

 * 18530e7c57 ("cpu-common.c: export cpu_get_free_index to be reused later")

 * cfb52d07f5 ("target/ppc: handle vcpu hotplug failure gracefully")

[ Test Plan ]

 * Setup an IBM Power10 system (with firmware FW1060 or newer,
   that comes with nested KVM support), running Ubuntu Server 24.04.

 * Install and configure KVM on this system with a (higher)
   maxvcpus value of 128, but have a (smaller) initial value of 4 vCPUs.
   $ virsh define ubu2404.xml
   (https://launchpadlibrarian.net/748483993/check.xml)

 * Now after successful definition, start the VM:
   $ virsh start ubu2404 --console

 * If the VM is up and running increase the vCPUs to a larger value
   here 68:
   $ virsh setvcpus ubu2404 68

 * A system with an unpatched qemu will crash, showing:
   [ 662.102542] KVM: Create Guest vcpu hcall failed, rc=-44
   error: Unable to read from monitor: Connection reset by peer

 * A patches environment will:
   - either just successfully hotplug the new amount (68) of vCPUs
     without further messages
   - or (in case very memory constraint) print a (graceful) error
     message that hotplug couldn't be performed,
     but stays up and running:
     error: internal error: unable to execute QEMU command 'device_add': \
     kvmppc_cpu_realize: vcpu hotplug failed with -12

 * Since certain firmware is required, IBM is doing the test and validation
   (and already successfully verified based on the PPA test builds).

[ Where problems could occur ]

 * All modification were done in target/ppc/kvm.c
   and are with that limited to the IBM Power platform,
   and will not affect other architectures.

 * The implementation of the pre-creation of vCPUs (init cpu_target_realize)
   may lead to early failures when a user doesn't expect to have such an
   amount of vCPUs yet.

 * And the pre-creation and especially parking (kvm_create_and_park_vcpu)
   will probably consume more resources than before.

 * Hence a patched system might run with a reduced max amount of vCPUs,
   but instead will not crash hard, but gracefully fail on lack of resources.

 * This case and the patch(es) are also discussed in more detail here:
   https://<email address hidden>/T/#t
   and here:
   https://bugzilla.redhat.com/show_bug.cgi?id=2304078

[ Other Info ]

 * The code is upstream accepted with qemu v9.1.0(-rc0),
   and the upload to oracular was done,
   and now only noble is affected.

 * Ubuntu releases older than noble are not affected,
   since (nested) KVM virtualization on P10
   was introduced starting with noble.
__________

== Comment: #0 - SEETEENA THOUFEEK <email address hidden> - 2024-08-12 03:47:06 ==
+++ This bug was initially created as a clone of Bug #205620 +++

---Problem Description---
cpu hotplug crashes the guest!cpu hotplug crashes the guest!

---Steps to Reproduce---
 I have been trying for the CPU hotplugging to the guest with maxvcpus as 128 and current value I am giving as 4! but when I try to hotplug 68 vcpus to the guest, it crahses and we get error message as:
[ 303.808494] KVM: Create Guest vcpu hcall failed, rc=-44
error: Unable to read from monitor: Connection reset by peer

Steps to reproduce:

1) virsh define bug.xml

2) virsh start Fedora39 --console

3) virsh setvcpus Fedora39 68

Output :
[ 662.102542] KVM: Create Guest vcpu hcall failed, rc=-44
error: Unable to read from monitor: Connection reset by peer

If resources are less, in my thinking it should fail gracefully!
Attaching the XML file that i have used and will post the observations on MDC system there i saw this same failure on higher number.

fixed with upstream commit

https://github.com/qemu/qemu/commit/cfb52d07f53aa916003d43f69c945c2b42bc6374

Machine Type = na

---Debugger---
A debugger is not configured

Contact Information = <email address hidden>

---uname output---
NA

Related branches

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-208538 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → qemu (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
Changed in qemu (Ubuntu Oracular):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → nobody
Changed in ubuntu-power-systems:
status: New → Triaged
Changed in qemu (Ubuntu Noble):
status: New → Triaged
Changed in qemu (Ubuntu Oracular):
status: New → Triaged
importance: Undecided → High
Changed in qemu (Ubuntu Noble):
importance: Undecided → High
Changed in ubuntu-power-systems:
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello Seeteena,

Thanks for your bug report.

While working on backporting the patch mentioned above to QEMU 8.2.2 (which is the version we have on Ubuntu Noble), I noticed that there are extra commits that are required for this one to work. Based on what I've seen, at least the following commits are necessary:

08c328682231b64878fc052a11091bea39577a6f
c6a3d7bc9e3acf2431ac23ae6dbeb28aa92f873c

The second one is simple, but depends on the first one, which is a bit more involved; it moves code around but also changes things (even though, according to its description, no functional changes were intended). Because we're talking about an LTS release, I'm a bit more hesitant to proceed. Also, it's worth mentioning that we may very well have to backport even more patches; the first patch does not apply cleanly either and I haven't checked the reason.

I will continue looking tomorrow, but it would be great to have IBM's assistance with this backport.

Thank you!

Revision history for this message
bugproxy (bugproxy) wrote : backported-patch-0001

------- Comment on attachment From <email address hidden> 2024-08-14 07:08 EDT-------

backported-patch-0001

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2024-08-14 07:15 EDT-------
All 4 required upstream patches (mentioned below) have been backported for Qemu v8.2.2 and attached to this BZ.

Pre-req patch:
08c3286822 accel/kvm: Extract common KVM vCPU {creation,parking} code

Fix patches:
c6a3d7bc9e accel/kvm: Introduce kvm_create_and_park_vcpu() helper
18530e7c57 cpu-common.c: export cpu_get_free_index to be reused later
cfb52d07f5 target/ppc: handle vcpu hotplug failure gracefully

Hope this helps.

Thanks
Harsh

Revision history for this message
bugproxy (bugproxy) wrote : backported-patch-0002

------- Comment on attachment From <email address hidden> 2024-08-14 07:10 EDT-------

backported-patch-0002

Revision history for this message
bugproxy (bugproxy) wrote : backported-patch-0003

------- Comment on attachment From <email address hidden> 2024-08-14 07:11 EDT-------

backported-patch-0003

Revision history for this message
bugproxy (bugproxy) wrote : backported-patch-0004

------- Comment on attachment From <email address hidden> 2024-08-14 07:12 EDT-------

backported-patch-0004

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you.

I will analyze the patches and proceed with the uploads.

Changed in qemu (Ubuntu Noble):
assignee: nobody → Sergio Durigan Junior (sergiodj)
Changed in qemu (Ubuntu Oracular):
assignee: nobody → Sergio Durigan Junior (sergiodj)
tags: added: server-todo
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi,

I have patched the QEMU package from Noble and built it in the following PPA:

https://launchpad.net/~sergiodj/+archive/ubuntu/qemu-bug2076587

Given the fact that we'll be relying on IBM to do the verification of this bug for us, would you be able to give this package a try and let me know if it fixes the issue?

Thank you.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

You will also find an Oracular QEMU package in that same PPA which contains the same fix. Could you please give it a try as well?

Thank you.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2024-08-30 02:58 EDT-------
Due to lesser systems availability i was not able to test this patch, will be doing it by today evening and update the bugzilla!

Sorry for the delay!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-09-03 04:19 EDT-------
Hi sergiodj,
I tried validating the qemu package available at provided location : https://launchpad.net/~sergiodj/+archive/ubuntu/qemu-bug2076587
However the issue is still seen and i do not see the fix patches in the diff available at above package location.
http://launchpadlibrarian.net/745431988/qemu_1%3A8.2.1+ds-1ubuntu1_1%3A8.2.2+ds-0ubuntu1.2.diff.gz

Could you please provide updated package with the backport fixes provided!

Thanks
Anushree Mathur

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi Anushree,

Thanks for taking the time to test the package. It does contain the backported patch, but I think I know what happened: the Security has uploaded a fix in the meantime, so the package from my PPA became obsolete and won't be installed automatically even after you enable the PPA in your system.

I rebased my changes on top of the latest QEMU version from Noble and uploaded a new package, whose version is 1:8.2.2+ds-0ubuntu1.3~ppa1. It's building now; please take a look when it's available and let me know if it fixes the issue.

Thanks again!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-09-06 02:50 EDT-------
Hi sergiodj,
Thanks for the update. I have validated the patch now and it is working fine.

Analysis:
~# qemu-system-ppc64 --version
QEMU emulator version 8.2.2 (Debian 1:8.2.2+ds-0ubuntu1.3~ppa1)
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers

~# virsh setvcpus check_hotplug 700
error: internal error: unable to execute QEMU command 'device_add': kvmppc_cpu_realize: vcpu hotplug failed with -12

L2 keeps running!

Thanks
Anushree Mathur

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi Anushree,

Thanks for the verification. I'll upload the package to Oracular and start preparing the SRU to Noble.

Changed in qemu (Ubuntu Oracular):
status: Triaged → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

Hello Anushree, would you mind sharing the bug.xml file that you mentioned and that you use for testing, for further reference (for us knowing how the VM was defined)?

Revision history for this message
Frank Heimes (fheimes) wrote :

And Anushree, would you also agree that bug LP: 2067383 / Bugzilla: 206641 - https://bugs.launchpad.net/bugs/2067383)
can be considered as a duplicate bug of this one (LP: 2076587 / Bugzilla: 208538) ?
They seem to be suspiciously similar ...

description: updated
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : guest xml for cpu hotplug test

------- Comment on attachment From <email address hidden> 2024-09-11 04:14 EDT-------

Hi fheimes,
Yes i agree, these all bugs are same!
I have validated on my system by updating the system with "apt upgrade" and the qemu version is updating with having the fix included in it.

qemu-system-ppc64 -version
QEMU emulator version 8.2.2 (Debian 1:8.2.2+ds-0ubuntu1.3~ppa1)
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers

Attaching the guest xml for future references!

Thanks
Anushree Mathur

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:9.0.2+ds-4ubuntu5

---------------
qemu (1:9.0.2+ds-4ubuntu5) oracular; urgency=medium

  * d/rules: Revert move of helper binaries (qemu-bridge-helper,
    virtfs-proxy-helper, vhost-user-gpu) from /usr/lib/qemu/ to
    /usr/libexec/qemu/. This was starting to cause breakages on other
    packages (libvirt, for example), and Debian went the same
    route. This change can be dropped next cycle when QEMU is merged
    again.
    See https://salsa.debian.org/qemu-team/qemu/-/commit/f265f4788f
    for Debian's counterpart. (LP: #2079870)

qemu (1:9.0.2+ds-4ubuntu4) oracular; urgency=medium

  * Fail gracefully when hotplugging a vCPU fails on PPC. (LP: #2076587)
    - d/p/u/lp2076587-cpu-hotplug-crashes-guest-*.patch: Backport
      patches for upstream fix.

 -- Sergio Durigan Junior <email address hidden> Tue, 10 Sep 2024 13:35:33 -0400

Changed in qemu (Ubuntu Oracular):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.