virtio-balloon change breaks migration from qemu prior to 4.0

Bug #1848497 reported by Christian Ehrhardt 
50
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Stein
High
Unassigned
Train
Undecided
Unassigned
Ussuri
Undecided
Unassigned
qemu (Ubuntu)
High
Christian Ehrhardt 
Eoan
High
Christian Ehrhardt 
Focal
High
Christian Ehrhardt 

Bug Description

[Impact]

 * Due to a bug in qemu in 4.0 the config size for virtio-baloon changed.
 * This breaks migration from pre 4.0 qemu because the PCI BAR size is
   affected.

 * Upstream has realized this and fixed it in 4.1, this backports the fix
   to qemu 4.0 in Ubuntu Eoan

[Test Case]

 * Take a pre-eoan (pre qemu 4.0) guest and check that your setup can
   migrate it back and forth with a eoan/qemu-4.0 target.
   Note: (always) use a versioned machine type like pc-i44fx-disco (also
   the default if you use disco as source).
   Then add a virt-baloon device to the guest on pre-4.0 and migrate it
   again.
   Unfixed the following error will show up:
   get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0

 * Unfixed -> Fixed qemu 4.0 migrations should work as well. While the
   other way around it could (size didn't change), but there are no
   guarantees (no logic in the target).

[Regression Potential]

 * Messing with machine types is always dangerous, as in case of a mistake
   things get even more complex. But in this case things seemed rather
   straight forward. Pre 4.0 code all behaves the same, only 4.0 gets the
   new attribute set and later code has logic to handle dynamic sizes.
   That way I think we are safe of machine-type regressions.
 * For the change in behavior, it changes pre 4.0 migrations, which atm
   are broken if a virt-baloon device is present. There is nothing to
   break more int hat use case, and if such a device isn't present it
   shouldn't change anything. Therefore IMHO safe again.

[Other Info]

 * n/a

---

Related but not the same as bug 1838569 which had two error signatures.
The first being covered there and the second handled here.

--- ---
Quote from https://bugs.launchpad.net/cloud-archive/+bug/1838569/comments/4
Daniel 'f0o' Preussker (dpreussker) wrote 1 hour ago: #4
With recent release of OpenStack Train this issue reappears...

Upgrading from Stein to Train will require all VMs to be hard-rebooted to be migrated as a final step because Live Migration fails with:

Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: Unable to read from monitor: Connection reset by peer
Oct 17 10:28:43 h2.1.openstack.r0cket.net libvirtd[1545]: internal error: qemu unexpectedly closed the monitor: 2019-10-17T10:28:42.981201Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0
                                                          2019-10-17T10:28:42.981250Z qemu-system-x86_64: Failed to load PCIDevice:config
                                                          2019-10-17T10:28:42.981263Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
                                                          2019-10-17T10:28:42.981272Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon'
                                                          2019-10-17T10:28:42.981391Z qemu-system-x86_64: warning: TSC frequency mismatch between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                          2019-10-17T10:28:42.983157Z qemu-system-x86_64: warning: TSC frequency mismatch between VM (2532609 kHz) and host (2532608 kHz), and TSC scaling unavailable
                                                          2019-10-17T10:28:42.983672Z qemu-system-x86_64: load of migration failed: Invalid argument

--- ---

Identified as:
Dr. David Alan Gilbert (dgilbert-h) wrote 1 hour ago: #5
Dnaiel: That's a different problem; 'Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0'; so should probably be a separate bug.

I'd bet on this being the one fixed by 2bbadb08ce272d65e1f78621002008b07d1e0f03

--- ---

And that is a fix that only is in qemu 4.1 and would be an open bug for Ubuntu and Cloud Archive

Related branches

tags: added: server-next
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

With a migration Bionic to Eoan with a balloon device I can confirm this.

Guestconfig:
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>

root@testkvm-bionic-from:~# virsh migrate --unsafe --live testguest qemu+ssh://10.192.69.27/system
error: internal error: qemu unexpectedly closed the monitor: 2019-10-21T13:44:16.155100Z qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
2019-10-21T13:44:18.530641Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: e1 device: 1 cmask: ff wmask: c0 w1cmask:0
2019-10-21T13:44:18.530657Z qemu-system-x86_64: Failed to load PCIDevice:config
2019-10-21T13:44:18.530660Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
2019-10-21T13:44:18.530663Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:06.0/virtio-balloon'
2019-10-21T13:44:18.530839Z qemu-system-x86_64: load of migration failed: Invalid argument

Changed in qemu (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Christian Ehrhardt  (paelzer)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: The fix in my PPA worked
I can start uploading as soon as Focal is open.

Changed in qemu (Ubuntu Eoan):
status: New → Triaged
Changed in qemu (Ubuntu Focal):
status: Confirmed → Triaged
Changed in qemu (Ubuntu Eoan):
assignee: nobody → Christian Ehrhardt  (paelzer)
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Now that Focal is open I have opened proper Focal MP replacing the old one and also an Eoan SRU MP right away.
=> https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374770
=> https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/374771

description: updated
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: uploaded to 20.04 Focal, considering SRUs (Eoan) after this completes

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu10

---------------
qemu (1:4.0+dfsg-0ubuntu10) focal; urgency=medium

  * d/p/ubuntu/lp-1848556-curl-Handle-success-in-multi_check_completion.patch:
    fix a potential hang when qemu or qemu-img where accessing http backed
    disks via libcurl (LP: #1848556)
  * d/p/u/lp-1848497-virtio-balloon-fix-QEMU-4.0-config-size-migration-in.patch:
    fix migration issue from qemu <4.0 when using virtio-balloon (LP: #1848497)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Oct 2019 14:51:45 +0200

Changed in qemu (Ubuntu Focal):
status: Triaged → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Focal is complete the MPs reviewed, SRU Teamplates ready and pre-tests done.
Uploading to E-unapproved for the SRU Teams consideration.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This was tonight first accepted and then immediately rejected as it was surpassed by a security fix.

=> Rebased and uploaded 1:4.0+dfsg-0ubuntu9.2 to eoan-unapproved again.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Christian, or anyone else affected,

Accepted qemu into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.0+dfsg-0ubuntu9.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Eoan):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-eoan
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (qemu/1:4.0+dfsg-0ubuntu9.2)

All autopkgtests for the newly accepted qemu (1:4.0+dfsg-0ubuntu9.2) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

ganeti/2.16.0-5ubuntu1 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/eoan/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

$ virsh migrate --unsafe --live f-testmigrate qemu+ssh://10.253.194.110/system
(no messages)

With the update from proposed is migrating just fine from Disco to Eoan now.

Setting verified

tags: added: verification-done verification-done-eoan
removed: verification-needed verification-needed-eoan
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.0+dfsg-0ubuntu9.2

---------------
qemu (1:4.0+dfsg-0ubuntu9.2) eoan; urgency=medium

  * d/p/ubuntu/lp-1848556-curl-Handle-success-in-multi_check_completion.patch:
    fix a potential hang when qemu or qemu-img where accessing http backed
    disks via libcurl (LP: #1848556)
  * d/p/u/lp-1848497-virtio-balloon-fix-QEMU-4.0-config-size-migration-in.patch:
    fix migration issue from qemu <4.0 when using virtio-balloon (LP: #1848497)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Oct 2019 14:51:45 +0200

Changed in qemu (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Sam Morrison (sorrison) wrote :

I'm seeing what I think is this issue for the stein cloud archive packages.

Source host (openstack rocky release):
qemu-system-x86:
  Installed: 1:2.11+dfsg-1ubuntu7.26

Destination host (openstack stein release):
qemu-system-x86:
  Installed: 1:3.1+dfsg-2ubuntu3.7~cloud0

Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2020-05-24T22:47:19.677896Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0
2020-05-24T22:47:19.677922Z qemu-system-x86_64: Failed to load PCIDevice:config
2020-05-24T22:47:19.677926Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
2020-05-24T22:47:19.677929Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon'
2020-05-24T22:47:19.678086Z qemu-system-x86_64: load of migration failed: Invalid argument: libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-05-24T22:47:19.677896Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0

Bit stuck on this so any pointers in the right direction would be great

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Sam,
Stein would atm be on "qemu - 1:3.1+dfsg-2ubuntu3.7~cloud0" and that matches your versions as reported.

3.1 didn't have the particular bug that was discussed and fixed here as it was only broken by a later commit [1] in qemu 4.0.

Therefore I'd ask you to file a new bug report for it.
Never the less I agree that the signature matches which is odd.

When you open the new bug please also (like this bug) open it against qemu as well as "Ubuntu Cloud Archive" as I'm interested to hear if the UCA Team has seen similar.

[1]: https://git.qemu.org/?p=qemu.git;a=commit;h=2bbadb08ce272d65e1f78621002008b07d1e0f03

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

BTW 2.11 -> 3.1 without the cloud archive in mind matches a migration from Bionic to Disco.
I have checked my test logs (a bit ago since Disco itself is EOL).
But at least this January 13 and 16th the migrations 2.11 -> 3.1 still were ok.

In my log that was between
B: qemu: 1:2.11+dfsg-1ubuntu7.21 libvirt: 4.0.0-1ubuntu8.14
D: qemu: 1:3.1+dfsg-2ubuntu3.7 libvirt: 5.0.0-1ubuntu2.6

  7.2.0 (11:51:19): Test live migration (extra option '') of a bionic guest testkvm-bionic-from/testkvm-disco-from
    7.2.1 (11:51:19): live migration (extra option '') testkvm-bionic-from -> testkvm-disco-from => Pass
    7.2.2 (11:51:26): Check if guest kvmguest-bionic-normal on testkvm-disco-from is alive => Pass

@Sam - please feel free to copy&paste that info into the new bug you are gonna be creating.

/me is now stopping to bump this resolved case here

Revision history for this message
Sam Morrison (sorrison) wrote :

Thanks for your pointers, please see https://bugs.launchpad.net/cloud-archive/+bug/1882416

Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Dan Streetman (ddstreet) wrote :

> please see https://bugs.launchpad.net/cloud-archive/+bug/1882416

ok i marked that bug as a dup of this one, and we'll prepare the patches for Stein using this bug

Revision history for this message
Dan Streetman (ddstreet) wrote :

The virtio_balloon_config size was changed in the Disco version of qemu because of bug 1836154, which backported the size change in patch:
ubuntu/lp-1836154-include-update-Linux-headers-to-4.21-rc1-5.0-rc1.patch

which was introduced in qemu version 1:3.1+dfsg-2ubuntu3.4:
http://launchpadlibrarian.net/438501259/qemu_1%3A3.1+dfsg-2ubuntu3.3_1%3A3.1+dfsg-2ubuntu3.4.diff.gz

so the fix from this bug also needs to be applied to the qemu in Stein

Revision history for this message
Dan Streetman (ddstreet) wrote :

@vtapia is preparing the patches for qemu in Stein for this bug and bug 1847361

Revision history for this message
Victor Tapia (vtapia) wrote :

Attached backported fix to bug 1847361. Fixes live migrations from 1:2.11+dfsg-1ubuntu7.32 (Queens/Rocky) and 1:3.1+dfsg-2ubuntu3.3 or previous (Stein) to latest Stein. I also tested the migration from the patched Stein to Train and works as expected.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Christian, or anyone else affected,

Accepted qemu into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
Revision history for this message
Trent Lloyd (lathiat) wrote :

I have verified the package for this specific virtio-balloon issue discussed in this bug only.

Migrating from 3.1+dfsg-2ubuntu3.2~cloud0
- To the latest released version (3.1+dfsg-2ubuntu3.7~cloud0) fails due to balloon setup

2020-10-26T07:40:30.157066Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0
2020-10-26T07:40:30.157431Z qemu-system-x86_64: Failed to load PCIDevice:config
2020-10-26T07:40:30.157443Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
2020-10-26T07:40:30.157448Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-balloon'
2020-10-26T07:40:30.159527Z qemu-system-x86_64: load of migration failed: Invalid argument
2020-10-26 07:40:30.223+0000: shutting down, reason=failed

- To the proposed version (3.1+dfsg-2ubuntu3.7~cloud1): works as expected

Marking as verification completed.

tags: added: verification-stein-done
removed: verification-stein-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package qemu - 1:3.1+dfsg-2ubuntu3.7~cloud1
---------------

 qemu (1:3.1+dfsg-2ubuntu3.7~cloud1) bionic-stein; urgency=medium
 .
   * d/p/ubuntu/lp-1848497-virtio-balloon-fix-QEMU-4.0-config-size-migration-in.patch:
     fix migration issue from qemu <4.0 when using virtio-balloon (LP: #1848497)
   * allow qemu to load old modules post upgrade (LP: #1847361)
     - d/p/ubuntu/lp-1847361-modules-load-upgrade.patch: to fallback module
       load to a versioned path
     - d/qemu-block-extra.*.in, d/qemu-system-gui.*.in: save shared objects on
       upgrade
     - d/rules: generate maintainer scripts matching package version on build
     - d/rules: enable --enable-module-upgrades where --enable-modules is set

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers