virtio-balloon change breaks migration from qemu prior to 4.0

Bug #1848497 reported by Christian Ehrhardt  on 2019-10-17
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
qemu (Ubuntu)
Status tracked in Focal
Eoan
High
Christian Ehrhardt 
Focal
High
Christian Ehrhardt 

Bug Description

[Impact]

 * Due to a bug in qemu in 4.0 the config size for virtio-baloon changed.
 * This breaks migration from pre 4.0 qemu because the PCI BAR size is
   affected.

 * Upstream has realized this and fixed it in 4.1, this backports the fix
   to qemu 4.0 in Ubuntu Eoan

[Test Case]

 * Take a pre-eoan (pre qemu 4.0) guest and check that your setup can
   migrate it back and forth with a eoan/qemu-4.0 target.
   Note: (always) use a versioned machine type like pc-i44fx-disco (also
   the default if you use disco as source).
   Then add a virt-baloon device to the guest on pre-4.0 and migrate it
   again.
   Unfixed the following error will show up:
   get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0

 * Unfixed -> Fixed qemu 4.0 migrations should work as well. While the
   other way around it could (size didn't change), but there are no
   guarantees (no logic in the target).

[Regression Potential]

 * Messing with machine types is always dangerous, as in case of a mistake
   things get even more complex. But in this case things seemed rather
   straight forward. Pre 4.0 code all behaves the same, only 4.0 gets the
   new attribute set and later code has logic to handle dynamic sizes.
   That way I think we are safe of machine-type regressions.
 * For the change in behavior, it changes pre 4.0 migrations, which atm
   are broken if a virt-baloon device is present. There is nothing to
   break more int hat use case, and if such a device isn't present it
   shouldn't change anything. Therefore IMHO safe again.

[Other Info]

 * n/a

---

Related but not the same as bug 1838569 which had two error signatures.
The first being covered there and the second handled here.

--- ---
Quote from https://bugs.launchpad.net/cloud-archive/+bug/1838569/comments/4
Daniel 'f0o' Preussker (dpreussker) wrote 1 hour ago: #4
With recent release of OpenStack Train this issue reappears...

Upgrading from Stein to Train will require all VMs to be hard-rebooted to be migrated as a final step because Live Migration fails with:

Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: Unable to read from monitor: Connection reset by peer
Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: internal error: qemu unexpectedly closed the monitor: 2019-10-17T10:28:42.981201Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0
                                                          2019-10-17T10:28:42.981250Z qemu-system-x86_64: Failed to load PCIDevice:config
                                                          2019-10-17T10:28:42.981263Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
                                                          2019-10-17T10:28:42.981272Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon'
                                                          2019-10-17T10:28:42.981391Z qemu-system-x86_64: warning: TSC frequency mismatch between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                          2019-10-17T10:28:42.983157Z qemu-system-x86_64: warning: TSC frequency mismatch between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                          2019-10-17T10:28:42.983672Z qemu-system-x86_64: load of migration failed: Invalid argument

--- ---

Identified as:
Dr. David Alan Gilbert (dgilbert-h) wrote 1 hour ago: #5
Dnaiel: That's a different problem; 'Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0'; so should probably be a separate bug.

I'd bet on this being the one fixed by 2bbadb08ce272d65e1f78621002008b07d1e0f03

--- ---

And that is a fix that only is in qemu 4.1 and would be an open bug for Ubuntu and Cloud Archive

Related branches

tags: added: server-next

With a migration Bionic to Eoan with a balloon device I can confirm this.

Guestconfig:
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>

root@testkvm-bionic-from:~# virsh migrate --unsafe --live testguest qemu+ssh://10.192.69.27/system
error: internal error: qemu unexpectedly closed the monitor: 2019-10-21T13:44:16.155100Z qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
2019-10-21T13:44:18.530641Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: e1 device: 1 cmask: ff wmask: c0 w1cmask:0
2019-10-21T13:44:18.530657Z qemu-system-x86_64: Failed to load PCIDevice:config
2019-10-21T13:44:18.530660Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
2019-10-21T13:44:18.530663Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:06.0/virtio-balloon'
2019-10-21T13:44:18.530839Z qemu-system-x86_64: load of migration failed: Invalid argument

Changed in qemu (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Christian Ehrhardt  (paelzer)

FYI: The fix in my PPA worked
I can start uploading as soon as Focal is open.

Changed in qemu (Ubuntu Eoan):
status: New → Triaged
Changed in qemu (Ubuntu Focal):
status: Confirmed → Triaged
Changed in qemu (Ubuntu Eoan):
assignee: nobody → Christian Ehrhardt  (paelzer)
importance: Undecided → High

Now that Focal is open I have opened proper Focal MP replacing the old one and also an Eoan SRU MP right away.
=> https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374770
=> https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374771

description: updated
description: updated

FYI: uploaded to 20.04 Focal, considering SRUs (Eoan) after this completes

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu10

---------------
qemu (1:4.0+dfsg-0ubuntu10) focal; urgency=medium

  * d/p/ubuntu/lp-1848556-curl-Handle-success-in-multi_check_completion.patch:
    fix a potential hang when qemu or qemu-img where accessing http backed
    disks via libcurl (LP: #1848556)
  * d/p/u/lp-1848497-virtio-balloon-fix-QEMU-4.0-config-size-migration-in.patch:
    fix migration issue from qemu <4.0 when using virtio-balloon (LP: #1848497)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Oct 2019 14:51:45 +0200

Changed in qemu (Ubuntu Focal):
status: Triaged → Fix Released

Focal is complete the MPs reviewed, SRU Teamplates ready and pre-tests done.
Uploading to E-unapproved for the SRU Teams consideration.

This was tonight first accepted and then immediately rejected as it was surpassed by a security fix.

=> Rebased and uploaded 1:4.0+dfsg-0ubuntu9.2 to eoan-unapproved again.

Hello Christian, or anyone else affected,

Accepted qemu into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.0+dfsg-0ubuntu9.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Eoan):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-eoan

All autopkgtests for the newly accepted qemu (1:4.0+dfsg-0ubuntu9.2) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

ganeti/2.16.0-5ubuntu1 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/eoan/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

$ virsh migrate --unsafe --live f-testmigrate qemu+ssh://10.253.194.110/system
(no messages)

With the update from proposed is migrating just fine from Disco to Eoan now.

Setting verified

tags: added: verification-done verification-done-eoan
removed: verification-needed verification-needed-eoan
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu9.2

---------------
qemu (1:4.0+dfsg-0ubuntu9.2) eoan; urgency=medium

  * d/p/ubuntu/lp-1848556-curl-Handle-success-in-multi_check_completion.patch:
    fix a potential hang when qemu or qemu-img where accessing http backed
    disks via libcurl (LP: #1848556)
  * d/p/u/lp-1848497-virtio-balloon-fix-QEMU-4.0-config-size-migration-in.patch:
    fix migration issue from qemu <4.0 when using virtio-balloon (LP: #1848497)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Oct 2019 14:51:45 +0200

Changed in qemu (Ubuntu Eoan):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers