KVM live migration fails

Bug #1783140 reported by bugproxy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Invalid
High
Canonical Kernel Team
qemu (Ubuntu)
Fix Released
High
Canonical Server
Xenial
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * Backport fix from the 2.6.2 stable branch to the qemu 2.5 in Xenial

 * Newer guests might use virtio attributes that are clobbered on
   migration with the old qemu code.

[Test Case]

 * Setup two Xenial hosts on ppc64el

 * Create a guest that has a rather new kernel (>=4.14) I'd recommend
   Bionic

 * Migrate that guest from Host1 to Host2

[Regression Potential]

 * The modification could affect virtio handling in other cases in a non
   expected way, but mostly related to migrations. So the expected
   regression would be issues to migrate properly.
   I verified plenty of migrations in regression testing and we had
   this very code in the Yakkety release as we picked 2.6.1 stable release
   back then. Due to that it is actually pretty well tested and should not
   really regress anything out in the wild.

[Other Info]

 * So far this only triggers on the confused endian marshalling on
   ppc64el, but in theory a different case could trigger it on x86 just as
   much.

---

Environment:
2 POWER8 with Ubuntu 16.04.4 LTS as KVM hypervisor.
1 KVM guest with Ubuntu 18.04 LTS. Virtual disk for the guest is a qcow2 file on an NFS share, accessible from both hypervisors, so live migration is possible and works for all other guests (SLES, RHEL, Ubunutu 16.04),
Live migratino of Ubuntu 18.04 guest fails on ppc, while the same test on an x86_64 environment suceeds.

root@pkvm2:~# virsh migrate --persistent --live p8lnxtst4 qemu+ssh://pkvm1/system
error: internal error: early end of file from monitor, possible problem: 2018-07-23T11:12:25.586385Z qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x38aa inconsistent with Host index 0xa980: delta 0x8f2a
2018-07-23T11:12:25.586434Z qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
2018-07-23T11:12:25.587246Z qemu-system-ppc64: load of migration failed: Operation not permitted

root@pkvm2:~# uname -a
Linux pkvm2 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:51:21 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

Related branches

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-169882 severity-high targetmilestone-inin1604
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → power (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Foundations Team (canonical-foundations)
importance: Undecided → High
tags: added: triage-g
affects: power (Ubuntu) → qemu (Ubuntu)
Steve Langasek (vorlon)
Changed in qemu (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Server Team (canonical-server)
Changed in ubuntu-power-systems:
assignee: Canonical Foundations Team (canonical-foundations) → nobody
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Server Team (canonical-server)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This issue is either in 16.04 qemu (not able to handle the new guests) or in 18.04s virtio drivers, so I'm adding a linux task for the kernel first.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Of the listed guests I assume Ubuntu 18.04 has the newest guest Kernel (read virtio drivers).
Several questions to ensure this is right:

1. could you report the kernel version of each of your test guests (assuming 4.4 and 4.15 for Ubuntu, but the exact version would be great to know)

2. it would be great to know if one of the newer qemu versions that already exist fix the virtio handling of this case. So once you recreated your case could you upgrade through the Ubuntu Cloud Archive provided versions and let us know if some of them work (and if so which ones exactly).
That might help a lot to track down a potentially existing virtio fix.
Just one by one upgrade the newer versions from [1] on source and target host.
Then shutdown and re-start your guest on the migration source and then migrate again.

3. #2 above changed qemu, the other element we might change is that you use the 4.15 based HWE kernel in a 16.04 guest - would that then fail just as the 18.04 guest does?

In general, I wonder what on this should be Power specific since it should after all use the same virtio drivers and host code.
I have seen other similar reports which always ended up as host/guest mismatch of virtio-handling - but never an arch specific one yet.

[1]: https://wiki.ubuntu.com/OpenStack/CloudArchive

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also is this a thing that was triggered once, or reproducible every time you run it.
In the latter case it might be very hard to track down.

I checked and as part of the qemu verification I do run&migrate older guests in newer hipervisors.
I do not yet do vice versa in a lot of tests, so even if this issue ends up non reproducible I might add that to our regular test set to be executed more often.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-07-30 06:53 EDT-------
Here ist the information I can give immediately:

(a) Ubuntu 16.04.4 KVM hypervisor:
kernel 4.4.0-130-generic
# virsh version
Compiled against library: libvirt 1.3.1
Using library: libvirt 1.3.1
Using API: QEMU 1.3.1
Running hypervisor: QEMU 2.5.0

(b) Ubuntu 18.04 VM:
kernel 4.15.0-23-generic

I will now do an upgrade (to 18.04.1 ?). But from what I can read about 18.04.1, it does not include a new HWE. Anyway I will test it.

other KVM guests (Ubuntu 16.04.x, RHEL, SLES):
I have to check. Most of them are customer systems, so I don't have a login for all of them, but I think I can get access .. just needs some time.

Regarding the platform question:
18.04 guest on 16.04.4 Hypervisor on x86_64: live migration works
18.04 guest on 16.04.4 Hypervisor on ppc64le: live migration fails

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-30 07:10 EDT-------
regarding reproducibility:

yes it is absolutely reproducible

In the meantime I updated the 18.04 VM to 18.04.1.
Live migration still fails.

root@pkvm2:~# virsh migrate --persistent --live p8lnxtst4 qemu+ssh://pkvm1/system
error: internal error: early end of file from monitor, possible problem: 2018-07-30T11:06:47.622840Z qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x8402 inconsistent with Host index 0x19f: delta 0x8263
2018-07-30T11:06:47.622897Z qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
2018-07-30T11:06:47.623487Z qemu-system-ppc64: load of migration failed: Operation not permitted

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, reading that your guests are customer systems I assume you'd need to setup a test system somewhere to confirm if different host qemu/libvirt/kernel versions would fix it.

The list of interesting checks:
16.04 Host as-is + 18.04 Guest + qemu from UCA Ocata
16.04 Host as-is + 18.04 Guest + qemu from UCA Pike
16.04 Host as-is + 18.04 Guest + qemu from UCA Queens
16.04 Host as-is + 16.04 Guest running the HWE 4.15 Kernel + Qemu as-is in 16.04

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-30 08:24 EDT-------
# add-apt-repository cloud-archive:ocata
cloud-archive for Ocata only supported on xenial

# add-apt-repository cloud-archive:pike
cloud-archive for Pike only supported on xenial

# add-apt-repository cloud-archive:queens
cloud-archive for Queens only supported on xenial

In other words: I cannot use those OCAs on 18.04.

I know migration was successful with 16.04, but I do not know the used kernel.
So i will now do the test with 16.04 again, with different kernels if possible.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I did not suggest to install the Cloud Archive code in the 18.04 Guest.
You said your host is 16.04 (Xenial) and there the different UCA's should be tried to bring new qemu/libvirt code to your host and due to that check if these newer versions already have a fix we would look for instead of debugging from scratch.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-30 09:35 EDT-------
The host (KVM hypervisor) is 16.04.
So you suggest to install newer UCA qemu on the hypervisors.
That is something I have to decline today. Those hosts are runnung some more VMs. So to update the hypervisors I need a service window aggreed with my customers. And I have to stay on a kind of supported mainstream level, no experimental stuff.
Until then I can do some tests with some guest VMs.
Or someone other has a test environment where he/she can play also with the hypervisors. As I mentioned: the problem is reproducible !

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-30 10:24 EDT-------
The host (KVM hypervisor) is 16.04.
So you suggest to install newer UCA qemu on the hypervisors.
That is something I have to decline today. Those hosts are runnung some more VMs. So to update the hypervisors I need a service window aggreed with my customers. And I have to stay on a kind of supported mainstream level, no experimental stuff.
Until then I can do some tests with some guest VMs.
Or someone other has a test environment where he/she can play also with the hypervisors. As I mentioned: the problem is reproducible !

----------

So now I did another test with a 16.04 guest. The problem gets worse, but maybe it helps in catching the bug.

I did a new installation of a VM with Ubuntu 16.04.5 LTS, kernel 4.4.0-131-generic #157-Ubuntu SMP.
Live migration succeeded.
Then I installed linux-generic-hwe-16.04.
The system booted with kernel 4.15.0-29-generic #31~16.04.1-Ubuntu SMP.
And live migration failed:
# virsh migrate --persistent --live p8lnxtst1 qemu+ssh://pkvm1/system
error: internal error: early end of file from monitor, possible problem: 2018-07-30T14:13:34.381447Z qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x302 inconsistent with Host index 0x16c: delta 0x196
2018-07-30T14:13:34.381496Z qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
2018-07-30T14:13:34.381806Z qemu-system-ppc64: load of migration failed: Operation not permitted

It is still very reproducible!

This means the new hwe kernel introduced the problem !!! Or it is just not compatible with 4.4.0-130-generic of the KVM hypervisor.
BTW, no entry in the /var/log/libvirt/qemu log files regarding the migration attempts. Any other log or trace files I could look for or activate?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks, that makes sense.
So it really seems to be the newer virtio drivers in the guest that triggers it - which is why the move to the HWE kernel

For test systems, I tried to grab a p8 but it failed to install three times in a row. Not sure what is broken atm, so at least for a while I'll rely on you doing the tests.
I totally understand that it is hard (and not recommended by me) to do all the tests with the customer VMs.

If you can do the tests still but have only a short time, just go as far as you can which means just to queens and skip the others in between. From there we can still decide if it is helpful to check the interim versions.

I'll continue trying to get a P8 up with this - maybe I can try one of the new P9 systems as well tomorrow.
Just to check - Is the power system you have a P8/P9/P? ?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I got one P8 machine working but no second one that I could reach from there to test the migratio.
My usual tricks around that with lxd containers didn't work atm so I rely on people with more P8 HW.

I asked a few people to poll on P8 Devs to take a look as well.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-01 04:48 EDT-------
My Systems are P8 (8247-22L).

No problem to do some more tests with KVM guests (working Mon, Wed and Fri).

We also plan to update the KVM hosts to 18.04.1, but have no fixed date for that.
If there are incompatibilities between kernel 4.4 and 4.15, would I maybe risk that then I cannot migrate 16.04 guests any longer? Did anyone tests this case?
The other bottomside of the upgrade would be that I cannot help any longer with tests on 16.04 hypervisors.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1783140] Comment bridged from LTC Bugzilla

> If there are incompatibilities between kernel 4.4 and 4.15, would I maybe
> risk that then I cannot migrate 16.04 guests any longer? Did anyone tests
> this case?
>

This way around (old guest/ new host) I cover migration tests before any
qemu/libvirt upload testin x86/s390/ppc64el (and a tiny bit of arm64).

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-08-01 05:11 EDT-------
Finally I could have a look on all the other VMs that migrate successful:
RHEL 7U1, kernel 3.10.0-229.20.1.ael7b.ppc64le
RHEL 7U3, kernel 3.10.0-514.6.1.el7.ppc64le
RHEL 7U3, kernel 3.10.0-862.3.2.el7.ppc64
SLES 12SP1, kernel 3.12.49-11-default
SLES 12SP2, kernel 4.4.21-69-default
Ubuntu 16.04.1 LTS, kernel 4.4.0-31-generic

So Ubuntu 18.04 was the first with a kernel really newer than that of the hypervisor (the SLES 12SP2 kernel is only slightly newer, but still a 4.4)

I can also do some tests with RHEL 7U5, SLES 12SP3, SLES 15 and Ubuntu 16.10.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-01 05:30 EDT-------
I just had a new VM with SLES 15, kernel 4.12.14-23-default.
Migration succeeded !
I can also do a test with Ubuntu 16.10 after lunch.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-01 07:47 EDT-------
16.10 test:
I have seen 16.10 in our foreman, so I thought I could do that test quickly.
But it looks like the 16.10 mirrors are already down because 16.10 is out of service :-(
That means those tests (16.10, 17.04, 17.10) would take more time. I would have to download the ISO files and do manual installations from ISO files.

Our plan for the hypervisors is to upgrade them to 18.04.1 on September 5.
Until then I could do some guest tests if they help finding the problem. And in case a fix becomes available, I could verify it.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Yes, 16.10, 17.04, 17.10 releases are end of life, but they should still be accessible via http://old-releases.ubuntu.com/ubuntu (16.10, 17.04) or regular archive mirrors (17.10, but it will at some point move to old-releases).

So you could deploy 16.04 and dist-upgrade via old-releases.ubuntu.com to yakkety, zesty and via regular mirror to artful. Obviously for testing / bisecting purposes only.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-06 07:08 EDT-------
Today I had my final test with an Ubuntu 17.10 guest installed from
http://old-releases.ubuntu.com/releases/17.10/ubuntu-17.10-server-ppc64el.iso
After installation it had kernel 4.13.0-16-generic and live migration was successful.
After an update+upgrade it had kernel 4.13.0-46-generic and live migration was still successful.
So the live migration problem on ppc was introduced between kernel 4.13.0-46-generic (Ubuntu 17.10) and kernel 4.15.0-23-generic (Ubuntu 18.04).
Let me know if there is anything else I can do to help solving this issue.

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: New → Incomplete
Changed in qemu (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in ubuntu-power-systems:
status: Incomplete → Triaged
Changed in ubuntu-power-systems:
assignee: Canonical Server Team (canonical-server) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
status: New → Triaged
tags: added: kernel-da-key
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-21 07:22 EDT-------
@Canonical, can this LP be closed? I don't see any addl. activities here..
Many thx in advance

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Last comments synced to launchpad indicated that migration of 18.04 guests, using 16.04 hypervisor is still not working.

And it would be nice, if it did.

So to me it looks like the bug is still present, even if you want to close this ticket on your side.

ps. last comment we have synced prior to today's was from xxdold on 6th of august, just in case there is any information lost in-between.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-21 09:46 EDT-------
@xnox, I will leave this ticket open till a final soluition is available.....

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-22 04:36 EDT-------
Yes, the problem is not solved.
Most of the actions have been tests on my side. So I could finally test with SLES15 and Ubuntu 17.10 guests, both migrating successful. But 18.04 does not migrate. This means there was a change between kernel 4.13.0-46 and 4.15.0-23 which introduced the problem on ppc but not on x86 (nobody ever tested z).

If someone wants to provide a fix, I can test it in our environment, but only before August 5.
On August 5 we will upgrade our hypervisors to 18.04 and then we can hopefully migrate all guests again ... at least until to the next newer guest that makes problems.

So I would appreciate if someone can catch that problem before we run into it again.
Maybe someone also wants to check if the problem also exists on z.
And maybe it's also worth to think about extending the test suites to also test such cases with guests newer than the hypervisors (if you think it's a valid scenario in the field).

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-22 04:41 EDT-------
sorry "August 5" means September 5 ;-)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reproduced:
2018-09-06T08:25:22.816912Z qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x8101 inconsistent with Host index 0xfd: delta 0x8004
2018-09-06T08:25:22.816963Z qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
2018-09-06T08:25:22.817478Z qemu-system-ppc64: load of migration failed: Operation not permitted

I need to break this out of the test automation and then in the Bionic guest set up different kernels to find which mismatch it is.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I can also confirm that driving the same case on x86 is not affected.

I'd be glad to pull in PPC-Developers in case they know of any virtio bugfix that was done in qemu and/or kernel that might be related.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

4.15.0-33-generic Failed,
4.4.0-134-generic we know works.
I tested 4.13.0-46-generic and it worked (as expected after the positive 17.10 report in comment #23)

checking different builds from there
Bionic builds (so even the 4.13.0-32.35 being "before" the tested -46 it is the Bionic build of it)
4.13.0-32.35 - the last 4.13 - working
4.14.0-16.19 - the last 4.14 - failing
4.15.0-9.10 - early 4.15 - failing

Note good as well as bad cases were the same guest (Bionic).
Just installing different kernels in that guest, rebooting into these kernels and then triggering the migration.

4.14.0-11.13 - early 4.14 - failing

So it seems 4.13->4.14 is the breaking point, I'll check if I can get a git bisect inside this working ...

Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Expected closest good/bad as git Builds:
Last 4.13 git tag Ubuntu-4.13.0-32.35 - working
First 4.14 git tag Ubuntu-4.14.0-11.13 -

$ git bisect start
$ git bisect good Ubuntu-4.13.0-32.35
$ git bisect bad Ubuntu-4.14.0-11.13
Build is slow, but it seems I can bisect from here ...

Arr shortly after this pulled me out of the Ubuntu tree to check the baseline 4.13.0 and that got me a non working kernel.
After discussion with the kernel Team I'm switching to bisecting mainline trees.
Lets check what surprises this has for me ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Note to myself:
Combine mainline kernel: git://kernel.ubuntu.com/ubuntu/linux.git
Ubuntu kernel: git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git
in one repo.
Get git://kernel.ubuntu.com/ubuntu/kteam-tools.git
Setup needed chroots (amd64 + arch you need)
sudo ./make_chroot bionic amd64 http://archive.ubuntu.com/ubuntu

(Then many cleanups depending on build env and needs :-/)
Call like:
../kteam-tools/mainline-build/mainline-build-one v4.13 bionic
../kteam-tools/mainline-build/mainline-build-one v4.14 bionic

That works with quite some extra setps, but since the changes are not in the Ubuntu extra I ended up switching to building upstream kernels (with Ubuntu kernel config).
(... needs more time ...)

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-09-07 08:38 EDT-------
I have some more bad surprises:
On Monday I upgraded my x86 KVM hypervisors from 16.04.5 to 18.04.1.
No problems at all.

On Wednesday I upgraded the ppc KVM hypervisors from 16.04.5 to 18.04.1.

Problem 1:
In the middle of the upgrade process I could not live migrate the guests from the 16.04 hypervisor to the 18.04 hypervisor.
None of the 13 guests!

root@pkvm2:~# virsh migrate --persistent --live pkut04 qemu+ssh://pkvm1/system
error: internal error: process exited while connecting to monitor: 2018-09-05T11:07:58.260851Z qemu-system-ppc64: warning: CPU(s) not present in any NUMA nodes: CPU 1 [core-id: 1]
2018-09-05T11:07:58.260859Z qemu-system-ppc64: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2018-09-05T11:07:58.262038Z qemu-system-ppc64: This machine version does not support CPU hotplug

So I had to shutdown all the guests to do the upgrade of the second hypervisor !!!

Probem 2:
When the second hypervisor was on 18.04.1 I could not start most of the guests.
Only 4 of 13 guests started.
(a) some qcow2 disks have been marked as sharable
worked on Ubuntu 16.04, but noot on 18.04
(b) vcpu definition
on Ubuntu 16.04 <vcpu placement='static' current='8'>160</vcpu> worked
on Ubuntu 18.04 this does not work on ppc ("This machine version does not support CPU hotplug").
I had to cahnge it to <vcpu placement='static'>8</vcpu>

I could resolve 2a and 2b. But it is frustrating to get such additional adventure games in the maintenance window.
You think "just start the guests, then I can go home", and then the guests do not start.
And it is even more frustrating when you did just the same task 2 days ago on x86 without any problems.

Maybe Problem 1 has the same reason as Problem 2. In other words: With the changed domain XML, maybe a live migration from Ubuntu 16.04 hypervisor to 18.04 hypervisor would work.
But I cannot verify this assumption. Now both my hypervisors are finally on 18.04.1 and live migration between them works for all guests.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

2a) is an upstream decision to have better disk integrity guarantees.
There is a release notes entry since 17.10 (which is also refrenced from the 18.04 release notes)
=> https://wiki.ubuntu.com/ArtfulAardvark/ReleaseNotes#qemu_2.10

2b) is I think an upstream change by the PPC devs actually.
I'll at least put such a case in my own testing to find it earlier next time and be able to work around if needed.

@JFH - if you think 2b is a big issue you might file a bug and abck-sync to IBM PPC Devs for their opinion on it.

I'll stick to the initially reported issue here and try to track it down over the next few days as my ppc64 machine time permits (the bisect is slow anyway).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (6.0 KiB)

TL;DR:
- a KVM guest with the kernel change as identified above
- works on Bionic host (kernel 4.15 / qemu 2.11 / libvirt 4.0)
- migrating on a Xenial host (kernel 4.4 / qemu 2.5 / libvirt 1.3.1) fails
  VQ 0 size 0x100 Guest index 0x8101 inconsistent with Host index 0x81: delta 0x8080
  error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
- not fixed in latest 4.19 kernel
- only failing on ppc64el (not x86) - maybe high/low word related
- qemu bisecting found a high/low word related virtio issue and fix in the 2.6 stable series that

Note: generated names are odd (hashes are ok), most 4.13 here are actually 4.14 in development.

GOOD v4.13 Mon Sep 10 10:03:38
BAD v4.14 Mon Sep 10 10:51:31
Step-1: 15d8ffc9 #1 Mon Sep 10 12:36:30 bad
Step-2: bafb0762 #2 Mon Sep 10 13:04:52 good
Step-3: b63f6044 #3 Mon Sep 10 13:24:27 bad
Step-4: e08af95d #4 Mon Sep 10 13:44:11 bad
Step-5: 2a493216 #5 Mon Sep 10 14:25:50 bad
Step-6: a248878d #6 Mon Sep 10 14:50:47 bad
Step-7: 160e22aa #7 Mon Sep 10 15:09:03 good
Step-8: 727f8914 #8 Mon Sep 10 18:30:06 good
Step-9: 4a3c67a6 #9 Mon Sep 10 20:37:37 bad
Step-10: 04584957 #10 Tue Sep 11 04:35:41 bad
Step-11: f7ce9103 #11 Tue Sep 11 05:30:50 bad
Step-12: 192f68cf #12 Tue Sep 11 05:49:50 good
Step-13: 3f93522f #13 Tue Sep 11 06:13:01 bad
Step-14: 4941d472 #14 Tue Sep 11 06:40:05 good

Offending change identified as:
commit 3f93522ffab2d46a36b57adf324a54e674fc9536
Author: Jason Wang <email address hidden>
Date: Wed Jul 19 16:54:49 2017 +0800

    virtio-net: switch off offloads on demand if possible on XDP set

    Current XDP implementation wants guest offloads feature to be disabled
    on device. This is inconvenient and means guest can't benefit from
    offloads if XDP is not used. This patch tries to address this
    limitation by disabling the offloads on demand through control guest
    offloads. Guest offloads will be disabled and enabled on demand on XDP
    set.

    Signed-off-by: Jason Wang <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

To check if any commit in the latest kernel fixed the issue:
4.19-rc3 as of today (11da3a7f): bad
=> Not fixed yet as a guest kernel commit.
=> Also I don't see any further how we could fix hat on the kernel side, despite the issue being introduced there

Since we had the report that a Bionic Host would be ok I bumped the test env up one by one.
(in order)
Libvirt 1.3.1 -> 4.0: still bad
kernel 4.4 -> 4.15: still bad
qemu 2.5 -> 2.11: working

So we are actually looking for a qemu fix for a kernel introduced issue it seems.
Via UCA we can access some rather easily.
qemu 2.5 (X) bad
qemu 2.6.1 (Y) good
qemu 2.8 (Z) good
qemu 2.10 (A) good
qemu 2.11: (B) good

So a qemu bisect for 2.5->2.6 it shall be :-/
Back then this was still based on full debian versions so no bisect directly in the packaging repo on these old versions.
Using checkinstall and the configure line of the qemu yakkety version (reset machine type to upstream type and linking spapr-rtas.bin [qemu-slof] and others to the expected place).
ln -s /usr/share/sl...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

PPA prepared at: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3412
MP: https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/354695

Test of the case on the PPA is successful.
But I'll need a regression check on that before going on.

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Prep complete:
- Regression tests from the PPA
- Tests of the bug being fixed on the PPA
- MP review
- SRU Template

Uploading for consideration by the SRU Team

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted qemu into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.32 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Xenial):
status: New → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in qemu (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-10-05 03:46 EDT-------
Hello Robie,

Because I migrated both my KVM hypervisors to 18.04, I cannot test it anymore.
But Christian Ehrhardt could reproduce the problem. Hopefully he has still the appropriate test envirnoment.

Regards, Andreas (bugproxy)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (5.8 KiB)

Testes as-is (to confirm we hit the bug)

1.0.0 (12:53:43): MIGRATE: in-release migrations
  1.1.0 (12:53:43): Clean testbeds
    1.1.1 (12:53:43): stop containers
    1.1.2 (12:53:43): orig: restore containers from snapshot: xenial
    1.1.3 (12:53:43): Restore testkvm-xenial-from
    1.1.4 (12:53:44): Restore testkvm-xenial-to
    1.1.5 (12:53:45): Restore testkvm-xenial-tononshared
    1.1.6 (12:53:45): wait until containers are fully started
  1.2.0 (12:53:56): unshare non shared container
    1.2.1 (12:54:00): Version info after restore
    1.2.2 (12:54:00): Version at testkvm-xenial-from: - qemu: 1:2.5+dfsg-5ubuntu10.31 libvirt: 1.3.1-1ubuntu10.24
    1.2.3 (12:54:00): Bios versions at testkvm-xenial-from: - ipxe: 1.0.0+git-20150424.a25a16d-1ubuntu1.2 slof: 20151103+dfsg-1ubuntu1.1 efi: not-installed
    1.2.4 (12:54:01): Version at testkvm-xenial-to: - qemu: 1:2.5+dfsg-5ubuntu10.31 libvirt: 1.3.1-1ubuntu10.24
    1.2.5 (12:54:01): Bios versions at testkvm-xenial-to: - ipxe: 1.0.0+git-20150424.a25a16d-1ubuntu1.2 slof: 20151103+dfsg-1ubuntu1.1 efi: not-installed
    1.2.6 (12:54:01): Version at testkvm-xenial-tononshared: - qemu: 1:2.5+dfsg-5ubuntu10.31 libvirt: 1.3.1-1ubuntu10.24
    1.2.7 (12:54:01): Bios versions at testkvm-xenial-tononshared: - ipxe: 1.0.0+git-20150424.a25a16d-1ubuntu1.2 slof: 20151103+dfsg-1ubuntu1.1 efi: not-installed
    1.2.8 (12:54:12): Ensure old migration guests of any release are removed
    1.2.9 (12:54:12): Remove all test guests of release trusty
    1.2.10 (12:54:26): Remove all test guests of release xenial
    1.2.11 (12:54:40): Remove all test guests of release bionic
    1.2.12 (12:54:54): Remove all test guests of release cosmic
    1.2.13 (12:55:07): Prep xenial guest creation on testkvm-xenial-from
    1.2.14 (12:55:07): spawn migration guests
    1.2.15 (13:00:59): Test machine type uniqueness within xenial => Pass
    1.2.16 (13:00:59): Check for expected machine type to be set => Pass

2.0.0 (13:01:00): Test migrations within xenial - round 1/5
  2.1.0 (13:01:00): Test live migration (extra option '') of a xenial guest testkvm-xenial-from/testkvm-xenial-to
    2.1.1 (13:01:00): live migration (extra option '') testkvm-xenial-from -> testkvm-xenial-to => Failed detail=live migration failed

---

Then running the same upgrading to proposed (actually all of proposed, so I hope nothing else in there breaks us now - as we have tested that in advance and it was good).

1.0.0 (13:37:26): MIGRATE: in-release migrations
  1.1.0 (13:37:26): Clean testbeds
    1.1.1 (13:37:26): stop containers
    1.1.2 (13:37:26): orig: restore containers from snapshot: xenial
    1.1.3 (13:37:26): Restore testkvm-xenial-from
    1.1.4 (13:37:27): Restore testkvm-xenial-to
    1.1.5 (13:37:27): Restore testkvm-xenial-noupd
    1.1.6 (13:37:28): Restore testkvm-xenial-tononshared
    1.1.7 (13:37:28): wait until containers are fully started
  1.2.0 (13:37:54): unshare non shared container
    1.2.1 (13:37:58): Version info after restore
    1.2.2 (13:37:58): Version at testkvm-xenial-from: - qemu: 1:2.5+dfsg-5ubuntu10.32 libvirt: 1.3.1-1ubuntu10.24
    1.2.3 (13:37:58): Bios versions at testkvm-xenial-from: - ipxe: 1.0....

Read more...

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Robie Basak (racb) wrote :

I see the following autopkgtest failures:

Regression in autopkgtest for open-iscsi (amd64): test log
Regression in autopkgtest for ubuntu-image (amd64): test log

However looking at the history the particular tests failing seem to be flaky and unrelated to this SRU, and all the others continue to pass.

Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.32

---------------
qemu (1:2.5+dfsg-5ubuntu10.32) xenial; urgency=medium

  * fix migration of new guests on ppc64el (LP: #1783140)
    Fixed by backporting two patches from the 2.6.x stable branch
    - d/p/ubuntu/lp-1783140-virtio-set-low-features-early-on-load.patch
    - d/p/ubuntu/lp-1783140-Revert-virtio-net-unbreak-self-announcement.patch

 -- Christian Ehrhardt <email address hidden> Tue, 11 Sep 2018 15:00:19 +0200

Changed in qemu (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yes both are known to be flaky - thanks!

Manoj Iyer (manjo)
Changed in linux (Ubuntu):
status: Triaged → Invalid
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-10-17 04:03 EDT-------
IBM Bugzilla status -> closed, fixed released by Canonical

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.