migration from qemu 2.5 to qemu 2.11 fails for pc-i440fx-wily machines

Bug #1829868 reported by Vladyslav Drok on 2019-05-21
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Status tracked in Eoan
Bionic
Undecided
Unassigned
Cosmic
Undecided
Unassigned
Disco
Undecided
Unassigned
Eoan
Medium
Unassigned

Bug Description

[Impact]

 * the machine type for wily (which we keep for migration from an early
   xenials qemu) is broken >=qemu 2.11

 * fix by correcting the definition of that type

[Test Case]

- Xenial / Bionic system
  $ lxc launch ubuntu-daily:x x-wily --profile default --profile kvm
  $ lxc launch ubuntu-daily:b b-wily --profile default --profile kvm
- set hostid to be different (as we have containers)
  $ vim /etc/libvirt/libvirtd.conf
  $ systemctl restart libvirtd
- exchange ssh keys
  $ ssh-keygen
  $ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
  (host)
  $ lxc file pull --recursive x-wily/root/.ssh .
  $ lxc file push --recursive .ssh b-wily/root/
  $ lxc exec b-wily -- chown -R root:root /root/.ssh
- use uvtool to create the same guest on both systems (FS layout of images)
  $ time uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily \
    arch=amd64 label=daily release=eoan
  $ uvt-kvm create --password ubuntu wilymigrate arch=amd64 release=eoan label=daily
- edit to set wily machine type (on source = Xenial)
  $ virsh edit wilymigrate
- on target remove the former definition
  $ virsh undefine wilymigrate
- Migrate
  $ virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.250/system
  error: internal error: qemu unexpectedly closed the monitor: 2019-05-22T13:04:19.108689Z qemu-
  system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control
  2019-05-22T13:04:19.151216Z qemu-system-x86_64: Configuration section missing
  2019-05-22T13:04:19.151336Z qemu-system-x86_64: load of migration failed: Invalid argument

With the fix on the target the issue is gone.

[Regression Potential]

The purpose to keep these old types is only to allow people to "migrate off" the older releases as long as they are still supported. Never the less sooner or later people are strongly encouraged [1]. Given your great analysis you know that already, but others might come by this bug.
It is important to be considered when deciding on if/what to change.

No one should ever start a "new" guest of a wily type on Bionic or later.
So I'm not too concerned about the delta we introduce to people having done that. The time this took to be found confirms that even migrations from Xenial with a Wily type are rare. We can not really differ between wily types coming from e.g. Xenial (or Trusty-Mitaka) hosts and wily type migrations coming from other Bionic systems (only a problem after a fix to this bug here and if they are on a different patch level).

So as mentioned the one real case for the wily type to still exist is to migrate off of older systems, and that use-case is broken. So I'm considering the thoughts above as a known, but less important than "the main use case" regression.

[Other Info]

 * n/a

---

In qemu 2.11 pc-i440fx-wily machine type is defined the following way by the ubuntu patch:

 101 +static void pc_wily_machine_options(MachineClass *m)
 102 +{
 103 + pc_i440fx_2_4_machine_options(m);
 104 + pc_i440fx_machine_options(m);
 105 + m->desc = "Ubuntu 15.04 PC (i440FX + PIIX, 1996)",
 106 + m->default_display = "std";
 107 +}
 108 +
 109 +DEFINE_I440FX_MACHINE(wily, "pc-i440fx-wily", pc_compat_2_3,
 110 + pc_wily_machine_options);

In qemu 2.5, pc_compat_2_3 contained the following skip flags: https://github.com/qemu/qemu/blob/v2.5.1.1/hw/i386/pc_piix.c#L304-L313 (skip configuration, skip section footers, and optional global state)

in qemu 2.11 those skips moved to pc_i440fx_2_3_machine_options:
https://github.com/qemu/qemu/blob/v2.11.2/hw/i386/pc_piix.c#L314-L320
https://github.com/qemu/qemu/blob/v2.11.2/hw/i386/pc_piix.c#L524-L529
https://github.com/qemu/qemu/blob/v2.11.2/include/hw/i386/pc.h#L573-L574
https://github.com/qemu/qemu/blob/v2.11.2/include/hw/compat.h#L193-L205

It happened in commits:
https://github.com/qemu/qemu/commit/71dd4c1a5672cafe9fb89abc83fe2a962f39ec42
https://github.com/qemu/qemu/commit/15c38503253bb9ba9b8efd17662069f69ca2b997
https://github.com/qemu/qemu/commit/5272298c48eb3a01c41a7822e6303d0a0a05f004

but pc_wily_machine_options still refers to pc_i440fx_2_4_machine_options instead of pc_i440fx_2_3_machine_options, migration config section is not being skipped on destination host and so during migration the issue happens:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name guest=instance-00054361,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-22-instance-00054361/master-key.aes -machine pc-i440fx-wily,accel=kvm,usb=off,dump-guest-core=off -cpu Broadwell -m 32768 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid 660fed6d-bb56-4e15-b866-007419be4cf3 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=15.1.5,serial=7074a01b-b759-4e91-978a-fde846e2ec9e,uuid=660fed6d-bb56-4e15-b866-007419be4cf3,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-22-instance-00054361/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=xx,keyid=masterKey0,iv=xx,format=base64 -drive 'file=rbd:ephemeral-vms-ssd/660fed6d-bb56-4e15-b866-007419be4cf3_disk:id=nova:auth_supported=cephx\;none:mon_host=10.154.29.44\:6789\;10.154.29.60\:6789\;10.154.29.76\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap,throttling.iops-total=1000' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -object secret,id=ide0-0-0-secret0,data=xx,keyid=masterKey0,iv=xx,format=base64 -drive 'file=rbd:ephemeral-vms-ssd/660fed6d-bb56-4e15-b866-007419be4cf3_disk.config:id=nova:auth_supported=cephx\;none:mon_host=10.154.29.44\:6789\;10.154.29.60\:6789\;10.154.29.76\:6789,file.password-secret=ide0-0-0-secret0,format=raw,if=none,id=drive-ide0-0-0,readonly=on,cache=writeback,discard=unmap,throttling.iops-total=1000' -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=55,id=hostnet0,vhost=on,vhostfd=58 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=02:60:27:5a:aa:ec,bus=pci.0,addr=0x3 -add-fd set=2,fd=60 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 10.170.4.69:13 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2019-05-14 11:53:57.583+0000: Domain id=22 is tainted: shell-scripts
2019-05-14T11:53:57.608860Z qemu-system-x86_64: -chardev pty,id=charserial1: char device redirected to /dev/pts/14 (label charserial1)
2019-05-14T11:53:57.978684Z qemu-system-x86_64: Configuration section missing
2019-05-14T11:53:57.978786Z qemu-system-x86_64: load of migration failed: Invalid argument
2019-05-14 11:53:57.998+0000: shutting down, reason=failed

Vladyslav Drok (vdrok) on 2019-05-21
description: updated
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Vladyslav Drok (vdrok) on 2019-05-21
description: updated

Thanks Vladyslav for the bug report and the great analysis done already.

/self rant break
Oh how I hate machine types, it is sad that we still need to add them at all for some special cases and encapsulating changes. But among all of them being sort of "ok" the wily type clearly is the top on my personal hate list.
/end rant

---

The purpose to keep these old types is only to allow people to "migrate off" the older releases as long as they are still supported. Never the less sooner or later people are strongly encouraged [1]. Given your great analysis you know that already, but others might come by this bug.
It is important to be considered when deciding on if/what to change.

No one should ever start a "new" guest of a wily type on Bionic or later.
So I'm not too concerned about the delta we introduce to people having done that. The time this took to be found confirms that even migrations from Xenial with a Wily type are rare. We can not really differ between wily types coming from e.g. Xenial (or Trusty-Mitaka) hosts and wily type migrations coming from other Bionic systems (only a problem after a fix to this bug here and if they are on a different patch level).

So as mentioned the one real case for the wily type to still exist is to migrate off of older systems, and that use-case is broken. So I'm considering the thoughts above as a known, but less important than "the main use case" regression.

---

I'd want to avoid touching pc_compat_2_3 itself (would affect other types), but just switching from wily from pc_i440fx_2_4_machine_options to pc_i440fx_2_3_machine_options seems wrong as well as this would make it loos other 2.4 attributes it had (Have I mentioned to hate the wily type most for its broken definition).

Maybe we should define a HW_COMPAT_WILY being the fuse of HW_COMPAT_2_4 / HW_COMPAT_2_3 that actually matches the past.

Let me prep a change with that and we can give it a try

[1]: https://wiki.ubuntu.com/QemuKVMMigration

Changed in qemu (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium

Ok, the first build (yet untested) is complete in PPA
https://launchpad.net/~paelzer/+archive/ubuntu/bug-1829868-broken-wily-type/+packages

Let me setup a testbed to check on that ...

TODO add one wily migration to the regression test suite (could work via cross + setmt?)

Collecting steps for the later following SRU teamplate.

- Xenial / Bionic system
  $ lxc launch ubuntu-daily:x x-wily --profile default --profile kvm
  $ lxc launch ubuntu-daily:b b-wily --profile default --profile kvm
- set hostid to be different (as we have containers)
  $ vim /etc/libvirt/libvirtd.conf
  $ systemctl restart libvirtd
- exchange ssh keys
  $ ssh-keygen
  $ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
  (host)
  $ lxc file pull --recursive x-wily/root/.ssh .
  $ lxc file push --recursive .ssh b-wily/root/
  $ lxc exec b-wily -- chown -R root:root /root/.ssh
- use uvtool to create the same guest on both systems (FS layout of images)
  $ time uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily \
    arch=amd64 label=daily release=eoan
  $ uvt-kvm create --password ubuntu wilymigrate arch=amd64 release=eoan label=daily
- edit to set wily machine type (on source = Xenial)
  $ virsh edit wilymigrate
- on target remove the former definition
  $ virsh undefine wilymigrate
- Migrate
  $ virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.250/system
  error: internal error: qemu unexpectedly closed the monitor: 2019-05-22T13:04:19.108689Z qemu-
  system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control
  2019-05-22T13:04:19.151216Z qemu-system-x86_64: Configuration section missing
  2019-05-22T13:04:19.151336Z qemu-system-x86_64: load of migration failed: Invalid argument

(Useful to do extra checks e.g. if the guest is alive)

Ok, the above reproduced the bug with qemu 1:2.11+dfsg-1ubuntu7.13, lets try the PPA.
Trying 1:2.11+dfsg-1ubuntu7.14~ppa2 ...

- Migrate
  root@x-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.250/system
  (worked)
  And back:
  root@b-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.237/system
  (worked)
  Check if it is alive and still up (trivial)
  root@x-wily:~# uvt-kvm ssh --insecure wilymigrate "uptime"
   13:07:44 up 9 min, 0 users, load average: 0.00, 0.03, 0.04

Ok, the PPA would work in the direct test case for this reported issue.

@Vladyslav - would you mind testing the PPA for you as well?

Once you report it good as well I'd shove it into some bigger regression tests that I have.
TODO: add an explicit wily migration to latest LTS and to latest -dev release to the tests.

Once all of these seem good we can fix it in Eoan (current development release) and then open up SRUs for Bionic/Cosmic/Disco

Changed in qemu (Ubuntu):
status: Triaged → Incomplete

Incomplete waiting for user confirmation (of the PPA)

Vladyslav Drok (vdrok) wrote :

Thanks you for the fix Christian. We have a version of ubuntu types definition patch different from the one in ppa, I'll try to sync it first and check.

Vladyslav Drok (vdrok) wrote :

The changes you propose have worked for me, but I also requested customer to validate this, just in case.

Vladyslav Drok (vdrok) wrote :

Customer also confirmed the fix worked for them, thank you!

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
Changed in qemu (Ubuntu Disco):
status: New → Triaged
Changed in qemu (Ubuntu Cosmic):
status: New → Triaged
Changed in qemu (Ubuntu Bionic):
status: New → Triaged

Tested migrations between X <-> B/C/D/E all working now.
Tested migration of a wily type guest between B<->C/D as an extra check.

With the fix all is good now.
Paths like X->C-D->B->X worked.
The only degradation is between updated and unupdated systems, but that was expected and acceptable in comparison to the fix.

Uploaded to Eoan and (as it will need there a while) also already to B/C/D -unapproved.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:3.1+dfsg-2ubuntu5

---------------
qemu (1:3.1+dfsg-2ubuntu5) eoan; urgency=medium

  * d/p/ubuntu/define-ubuntu-machine-types.patch: fix wily machine type being
    broken since 2.11 due to 2.3/2.4 version mismatch in its definition to
    fix migrations from old machines (LP: #1829868).
  * d/p/ubuntu/lp-1830704-s390x-cpumodel-ignore-csske-for-expansion.patch
    toleration for future machines (LP: #1830704

 -- Christian Ehrhardt <email address hidden> Tue, 28 May 2019 11:30:42 +0200

Changed in qemu (Ubuntu Eoan):
status: Triaged → Fix Released
Łukasz Zemczak (sil2100) wrote :

Can we get the SRU information on this bug? Would be great to get a test case and regression potential analysis here.

Added the SRU template, I had all in the comments and forgot I haven't copied it to the description yet - sorry.
Thanks for the ping sil2100

description: updated

Hello Vladyslav, or anyone else affected,

Accepted qemu into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:3.1+dfsg-2ubuntu3.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Disco):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-disco
Changed in qemu (Ubuntu Cosmic):
status: Triaged → Fix Committed
tags: added: verification-needed-cosmic
Timo Aaltonen (tjaalton) wrote :

Hello Vladyslav, or anyone else affected,

Accepted qemu into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.12+dfsg-3ubuntu8.9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed-bionic
Timo Aaltonen (tjaalton) wrote :

Hello Vladyslav, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-1ubuntu7.15 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Trying Xenial -> test-subject and back.

root@x-wily:~# virsh dumpxml wilymigrate | grep '<type'
    <type arch='x86_64' machine='pc-i440fx-wily'>hvm</type>

Cosmic:
root@x-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.15/system
root@c-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.237/system
(worked without issues, checking the guest on the target via login)

Bionic:
root@x-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.250/system
root@b-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.237/system
(worked without issues, checking the guest on the target via login)

Disco:
root@x-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.97/system
root@d-wily:~# virsh migrate --unsafe --live wilymigrate qemu+ssh://10.253.194.237/system
(worked without issues, checking the guest on the target via login)

Poor little guest was punted around all the time, go to sleep ...
$ virsh shutdown wilymigrate

Setting verified.

tags: added: verification-done verification-done-bionic verification-done-cosmic verification-done-disco
removed: verification-needed verification-needed-bionic verification-needed-cosmic verification-needed-disco
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.15

---------------
qemu (1:2.11+dfsg-1ubuntu7.15) bionic; urgency=medium

  * d/p/ubuntu/define-ubuntu-machine-types.patch: fix wily machine type being
    broken since 2.11 due to 2.3/2.4 version mismatch in its definition to
    fix migrations from old machines (LP: #1829868).
  * d/p/ubuntu/lp-1830704-s390x-cpumodel-ignore-csske-for-expansion.patch
    toleration for future machines (LP: #1830704

 -- Christian Ehrhardt <email address hidden> Wed, 22 May 2019 13:14:15 +0200

Changed in qemu (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.12+dfsg-3ubuntu8.9

---------------
qemu (1:2.12+dfsg-3ubuntu8.9) cosmic; urgency=medium

  * d/p/ubuntu/define-ubuntu-machine-types.patch: fix wily machine type being
    broken since 2.11 due to 2.3/2.4 version mismatch in its definition to
    fix migrations from old machines (LP: #1829868).
  * d/p/ubuntu/lp-1830704-s390x-cpumodel-ignore-csske-for-expansion.patch
    toleration for future machines (LP: #1830704

 -- Christian Ehrhardt <email address hidden> Tue, 28 May 2019 10:49:09 +0200

Changed in qemu (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:3.1+dfsg-2ubuntu3.2

---------------
qemu (1:3.1+dfsg-2ubuntu3.2) disco; urgency=medium

  * d/p/ubuntu/define-ubuntu-machine-types.patch: fix wily machine type being
    broken since 2.11 due to 2.3/2.4 version mismatch in its definition to
    fix migrations from old machines (LP: #1829868).
  * d/p/ubuntu/lp-1830704-s390x-cpumodel-ignore-csske-for-expansion.patch
    toleration for future machines (LP: #1830704
  * d/control-in, d/control: add versioned dependencies to libseccomp 2.4 as
    any rebuild against 2.4 as it is in proposed right now will otherwise
    crash (LP: #1830859).

 -- Christian Ehrhardt <email address hidden> Tue, 28 May 2019 10:52:47 +0200

Changed in qemu (Ubuntu Disco):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers