Comment 32 for bug 1783140

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

TL;DR:
- a KVM guest with the kernel change as identified above
- works on Bionic host (kernel 4.15 / qemu 2.11 / libvirt 4.0)
- migrating on a Xenial host (kernel 4.4 / qemu 2.5 / libvirt 1.3.1) fails
  VQ 0 size 0x100 Guest index 0x8101 inconsistent with Host index 0x81: delta 0x8080
  error while loading state for instance 0x0 of device 'pci@800000020000000:01.0/virtio-net'
- not fixed in latest 4.19 kernel
- only failing on ppc64el (not x86) - maybe high/low word related
- qemu bisecting found a high/low word related virtio issue and fix in the 2.6 stable series that

Note: generated names are odd (hashes are ok), most 4.13 here are actually 4.14 in development.

GOOD v4.13 Mon Sep 10 10:03:38
BAD v4.14 Mon Sep 10 10:51:31
Step-1: 15d8ffc9 #1 Mon Sep 10 12:36:30 bad
Step-2: bafb0762 #2 Mon Sep 10 13:04:52 good
Step-3: b63f6044 #3 Mon Sep 10 13:24:27 bad
Step-4: e08af95d #4 Mon Sep 10 13:44:11 bad
Step-5: 2a493216 #5 Mon Sep 10 14:25:50 bad
Step-6: a248878d #6 Mon Sep 10 14:50:47 bad
Step-7: 160e22aa #7 Mon Sep 10 15:09:03 good
Step-8: 727f8914 #8 Mon Sep 10 18:30:06 good
Step-9: 4a3c67a6 #9 Mon Sep 10 20:37:37 bad
Step-10: 04584957 #10 Tue Sep 11 04:35:41 bad
Step-11: f7ce9103 #11 Tue Sep 11 05:30:50 bad
Step-12: 192f68cf #12 Tue Sep 11 05:49:50 good
Step-13: 3f93522f #13 Tue Sep 11 06:13:01 bad
Step-14: 4941d472 #14 Tue Sep 11 06:40:05 good

Offending change identified as:
commit 3f93522ffab2d46a36b57adf324a54e674fc9536
Author: Jason Wang <email address hidden>
Date: Wed Jul 19 16:54:49 2017 +0800

    virtio-net: switch off offloads on demand if possible on XDP set

    Current XDP implementation wants guest offloads feature to be disabled
    on device. This is inconvenient and means guest can't benefit from
    offloads if XDP is not used. This patch tries to address this
    limitation by disabling the offloads on demand through control guest
    offloads. Guest offloads will be disabled and enabled on demand on XDP
    set.

    Signed-off-by: Jason Wang <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

To check if any commit in the latest kernel fixed the issue:
4.19-rc3 as of today (11da3a7f): bad
=> Not fixed yet as a guest kernel commit.
=> Also I don't see any further how we could fix hat on the kernel side, despite the issue being introduced there

Since we had the report that a Bionic Host would be ok I bumped the test env up one by one.
(in order)
Libvirt 1.3.1 -> 4.0: still bad
kernel 4.4 -> 4.15: still bad
qemu 2.5 -> 2.11: working

So we are actually looking for a qemu fix for a kernel introduced issue it seems.
Via UCA we can access some rather easily.
qemu 2.5 (X) bad
qemu 2.6.1 (Y) good
qemu 2.8 (Z) good
qemu 2.10 (A) good
qemu 2.11: (B) good

So a qemu bisect for 2.5->2.6 it shall be :-/
Back then this was still based on full debian versions so no bisect directly in the packaging repo on these old versions.
Using checkinstall and the configure line of the qemu yakkety version (reset machine type to upstream type and linking spapr-rtas.bin [qemu-slof] and others to the expected place).
ln -s /usr/share/slof/* /usr/share/qemu/slof.bin
ln -s /usr/share/seabios/* /usr/share/qemu/
ln -s /usr/lib/ipxe/qemu/* /usr/share/qemu/

But I realized that 2.6.0 was affected as well.
Maybe the fix was part of 2.6.1?
I checked the last 2.6.0 publish we had back in Yakkety and it failed as well.

So after all the bisect might be much smaller between 2.6.0 and its upstream stable branch.
Verified start points with builds from git.
qemu-2.6.0-bisect-start: old behavior (bad)
qemu-2.6.2-bisect-start: new behavior (good)

Eventually came down to:
git bisect start
# new: [529d45e151d82a772cd9b9af64bb25f88fba6567] Update version for 2.6.2 release
git bisect new 529d45e151d82a772cd9b9af64bb25f88fba6567
# old: [bfc766d38e1fae5767d43845c15c79ac8fa6d6af] Update version for v2.6.0 release
git bisect old bfc766d38e1fae5767d43845c15c79ac8fa6d6af
# new: [ec211e742683d4bc187839b01a4b0056617681a1] atapi: fix halted DMA reset
git bisect new ec211e742683d4bc187839b01a4b0056617681a1
# old: [71798fda8b6ef8df47c7640ba0bc24d7060ad307] vmsvga: shadow fifo registers
git bisect old 71798fda8b6ef8df47c7640ba0bc24d7060ad307
# old: [909d87d347a7a5e08c32cbdb67bb2927fcefbf34] virtio: set low features early on load
git bisect old 909d87d347a7a5e08c32cbdb67bb2927fcefbf34
# new: [28eae0af65dcae887d3cd32212c702ee708c84be] Fix some typos found by codespell
git bisect new 28eae0af65dcae887d3cd32212c702ee708c84be
# new: [704ab2fce49fa404a61c6dac85003bcc1e3d0192] blockdev: Fix regression with the default naming of throttling groups
git bisect new 704ab2fce49fa404a61c6dac85003bcc1e3d0192
# new: [025c4e39f479eb498ee63b634d961a4cf357773e] s390x/ipl: fix reboots for migration from different bios
git bisect new 025c4e39f479eb498ee63b634d961a4cf357773e
# new: [82c85167791f0057752c2084f8480bf19401f314] Revert "virtio-net: unbreak self announcement and guest offloads after migration"
git bisect new 82c85167791f0057752c2084f8480bf19401f314
# first new commit: [82c85167791f0057752c2084f8480bf19401f314] Revert "virtio-net: unbreak self announcement and guest offloads after migration"

And the fixing qemu change being:
82c85167791f0057752c2084f8480bf19401f314 is the first new commit
commit 82c85167791f0057752c2084f8480bf19401f314
Author: Michael S. Tsirkin <email address hidden>
Date: Mon Jul 4 14:47:37 2016 +0300

    Revert "virtio-net: unbreak self announcement and guest offloads after migration"

    This reverts commit 1f8828ef573c83365b4a87a776daf8bcef1caa21.

    Cc: <email address hidden>
    Reported-by: Robin Geuze <email address hidden>
    Tested-by: Robin Geuze <email address hidden>
    Signed-off-by: Michael S. Tsirkin <email address hidden>
    (cherry picked from commit 6c6668232e71b7cf7ff39fa1a7abf660c40f9cea)
    Signed-off-by: Michael Roth <email address hidden>

Its backport needs to be bundled with another fix to actually work (the commit before).

I'll try to backport and prep a PPA with those fixes for Xenial