Fix broken PCIe device after migration for Q35 machines

Bug #2033193 reported by Matthew Heler
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Incomplete
Undecided
Unassigned

Bug Description

There was a bug introduced in qemu 6.1 and 6.2 Q35 machine types that will cause hot-plugged PCIe devices to go offline during a migration. This was fixed in Qemu 7.0, and backported by other vendors in there downstream release.

The bug is here, https://<email address hidden>/msg872211.html

I can hit this bug currently when running OpenStack on top of Ubuntu 22.04, and using the provided libvirt and qemu-kvm packages from that release.

Is it possible we can get this fix backported to Jammy?

Thank you,

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the report.
Given the versions the only affected release i Jammy and =>Lunar are already Fixed.

The patches you referred:
- #1 & #2 got agreed
- #3 got challenged and not accepted (https://<email address hidden>/msg872376.html)
- neither of them ever landed upstream, they seem to only ever made it to the downstreams

By that I'm not as sure about "fixed in qemu 7.0". Maybe it went through a lot of changes and I couldn't spot it easily, since you mentioned "fixed in Qemu 7.0" do you have a pointer to the change that actually went into qemu 7.0?

I've spotted it in the downstream though:
https://git.centos.org/rpms/qemu-kvm/blob/24c15060b5a0a922f6472e49b2087ad174ffb63d/f/SOURCES/kvm-pci-expose-TYPE_XIO3130_DOWNSTREAM-name.patch
https://git.centos.org/rpms/qemu-kvm/blob/24c15060b5a0a922f6472e49b2087ad174ffb63d/f/SOURCES/kvm-acpi-pcihp-pcie-set-power-on-cap-on-parent-slot.patch

And I agree that there it was dropped in the move to 7.0 - but there was no reference by what it was replaced.

I have not hit the same issue in Jammy when migrating instances.
Therefore I think "can hit this bug currently when running OpenStack on top of Ubuntu 22.04" is a bit too generic.
Therefore a few questions that shall help with the later SRU [1] processing:
1. Is there any special device or anything else that you use to be exposed and see that issue?
2. How does your guest XML look like?
3. Bug https://bugzilla.redhat.com/show_bug.cgi?id=2053584 speaks of a virtio-serial-port, is that something you are using as well?
4. Is it then also a guest soft lockup or some other symptom for you?

Note: The upstream bug https://bugzilla.redhat.com/show_bug.cgi?id=2053584 has repro ideas that could be useful as well.

Further question, if one would - for trying - provide you with a PPA of jammy plus those fixes, could you easily test this in your environment?

[1]: https://wiki.ubuntu.com/StableReleaseUpdates

Changed in qemu (Ubuntu):
status: New → Incomplete
status: Incomplete → Fix Released
Changed in qemu (Ubuntu Jammy):
status: New → Incomplete
Revision history for this message
Matthew Heler (mheler) wrote (last edit ):

Hi Christian,

The fix landed in Qemu 7.0, and it's defined in the public changelog website from qemu.org:
https://wiki.qemu.org/ChangeLog/7.0#PCI/PCIe

The issue only occurs when using Q35 Machine type VMs, add any type of hot-plugged device (in my case a disk drive) then perform I/O on that device while simultaneously doing a live migration. After that I see the same errors shown in the RH upstream bug ( guest cpu soft lockup, then all I/O to the device is stalled ).

A PPA would be fine, but I think the community and others may benefit from having this pushed as a Jammy update.

Thank you,

Revision history for this message
Sven Kieske (s-kieske) wrote :

the changelog mentions this patch specifically as a fix:

https://github.com/qemu/qemu/commit/6b0969f1ec825984cd74619f0730be421b0c46fb

HTH

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Thanks Matthew and Sven for all the additional information (and for the potential fix).

To keep moving this forward, do we have a minimal reproducer for this one so we can verify the mentioned patch indeed fixes the issue?

BTW, I suppose the Christian's PPA proposal would be just a first step towards an SRU: we'd verify the bug is reproducible; ensure we can indeed fix the issue with no regressions by providing a PPA for you to test; and then proceed with an SRU in case the tests are successful.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.