live migration intermittently fails in CI with "VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8"

Bug #1761798 reported by Matt Riedemann
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Medium
Unassigned
QEMU
Expired
Undecided
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz

2018-04-05T21:48:38.205752Z qemu-system-x86_64: -chardev pty,id=charserial0,logfile=/dev/fdset/1,logappend=on: char device redirected to /dev/pts/0 (label charserial0)
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
2018-04-05T21:48:43.153268Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8
2018-04-05T21:48:43.153288Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2018-04-05T21:48:43.153292Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-blk'
2018-04-05T21:48:43.153347Z qemu-system-x86_64: load of migration failed: Operation not permitted
2018-04-05 21:48:43.198+0000: shutting down, reason=crashed

And in the n-cpu logs on the other host:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/screen-n-cpu.txt.gz#_Apr_05_21_48_43_257541

There is a related Red Hat bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1450524

The CI job failures are at present using the Pike UCA:

ii libvirt-bin 3.6.0-1ubuntu6.2~cloud0

ii qemu-system-x86 1:2.10+dfsg-0ubuntu3.5~cloud0

Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe when we move to use the Queens UCA we'll have better luck with this:

https://review.openstack.org/#/c/554314/

That has these package versions:

ii libvirt-bin 4.0.0-1ubuntu4~cloud0

ii qemu-system-x86 1:2.11+dfsg-1ubuntu2~cloud0

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

The offending instance QEMU from the log that crashed:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz

----------------------------------------------------------------------------
2018-04-05 21:48:38.136+0000: starting up libvirt version: 3.6.0, package: 1ubuntu6.2~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Wed, 07 Feb 2018 20:05:24 +0000), qemu version: 2.10.1(Debian 1:2.10+dfsg-0ubuntu3.5~cloud0), hostname: ubuntu-xenial-ovh-bhs1-0003361707
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-x86_64 -name guest=instance-00000002,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-instance-00000002/master-key.aes -machine pc-i440fx-artful,accel=tcg,usb=off,dump-guest-core=off -m 64 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid e70df34f-395e-4482-9c24-101f12cf635d -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=17.0.0,serial=97e1e00c-8afb-4356-b55e-92e997c5a1a7,uuid=e70df34f-395e-4482-9c24-101f12cf635d,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-instance-00000002/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/opt/stack/data/nova/instances/e70df34f-395e-4482-9c24-101f12cf635d/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:15:ea:59,bus=pci.0,addr=0x3 -add-fd set=1,fd=32 -chardev pty,id=charserial0,logfile=/dev/fdset/1,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2018-04-05T21:48:38.205752Z qemu-system-x86_64: -chardev pty,id=charserial0,logfile=/dev/fdset/1,logappend=on: char device redirected to /dev/pts/0 (label charserial0)
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
2018-04-05T21:48:43.153268Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8
2018-04-05T21:48:43.153288Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2018-04-05T21:48:43.153292Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-blk'
2018-04-05T21:48:43.153347Z qemu-system-x86_64: load of migration failed: Operation not permitted
2018-04-05 21:48:43.198+0000: shutting down, reason=crashed
----------------------------------------------------------------------------

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

I'm not sure that's the same as the bz 1450524 - in this case it's virtio-blk, where I *think* that bz was tracked down to a virtio-balloon and a virtio-net issue.

Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

Ran into a similar issue, when snapshotting an instance, it won't start afterwards.
Root cause is inconsistency between state of device virtio-scsi and the saved VM RAM state in /var/lib/libvirt/qemu/save/
Removing instance-0000****.save file from thsi folder allows starting the VM back but may cause data corruption.
Specifics of my case - using virtio-scsi driver and discard enabled, backend is Ceph. Impacted 2 Windows VMs so far.

Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

I also think this is a qemu bug, will do some tests, try to reproduce.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (10.6 KiB)

Source node
https://zuul.opendev.org/t/openstack/build/d50877ae15db4022b82f4bb1d1d52cea/log/logs/subnode-2/libvirt/qemu/instance-0000001a.txt

2020-11-20 14:25:24.887+0000: starting up libvirt version: 6.0.0, package: 0ubuntu8.4~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Tue, 15 Sep 2020 20:36:28 +0000), qemu version: 4.2.1Debian 1:4.2-3ubuntu6.7~cloud0, kernel: 4.15.0-124-generic, hostname: ubuntu-bionic-ovh-bhs1-0021872195

LC_ALL=C \

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \

HOME=/var/lib/libvirt/qemu/domain-19-instance-0000001a \

XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-19-instance-0000001a/.local/share \

XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-19-instance-0000001a/.cache \

XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-19-instance-0000001a/.config \

QEMU_AUDIO_DRV=none \

/usr/bin/qemu-system-x86_64 \

-name guest=instance-0000001a,debug-threads=on \

-S \

-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-19-instance-0000001a/master-key.aes \

-machine pc-i440fx-4.2,accel=tcg,usb=off,dump-guest-core=off \

-cpu qemu64,hypervisor=on,lahf-lm=on \

-m 128 \

-overcommit mem-lock=off \

-smp 1,sockets=1,cores=1,threads=1 \

-uuid 2c468d92-4b19-426a-8c25-16b4624c21a4 \

-smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=22.1.0,serial=2c468d92-4b19-426a-8c25-16b4624c21a4,uuid=2c468d92-4b19-426a-8c25-16b4624c21a4,family=Virtual Machine' \

-no-user-config \

-nodefaults \

-chardev socket,id=charmonitor,fd=35,server,nowait \

-mon chardev=charmonitor,id=monitor,mode=control \

-rtc base=utc \

-no-shutdown \

-boot strict=on \

-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \

-blockdev '{"driver":"file","filename":"/opt/stack/data/nova/instances/_base/61bd5e531ab4c82456aa5300ede7266b3610be79","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \

-blockdev '{"node-name":"libvirt-2-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \

-blockdev '{"driver":"file","filename":"/opt/stack/data/nova/instances/2c468d92-4b19-426a-8c25-16b4624c21a4/disk","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \

-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-2-format"}' \

-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on \

-netdev tap,fd=37,id=hostnet0 \

-device virtio-net-pci,host_mtu=1400,netdev=hostnet0,id=net0,mac=fa:16:3e:43:11:f4,bus=pci.0,addr=0x3 \

-add-fd set=2,fd=39 \

-chardev pty,id=charserial0,logfile=/dev/fdset/2,logappend=on \

-device isa-serial,chardev=charserial0,id=serial0 \

-vnc 127.0.0.1:0 \

-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \

-incoming defer \

-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \

-object rng-random,id=objrng0,filename=/dev/urandom \

-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x6 \

-san...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

in the new case we have the same qemu/libvirt version in both nodes

4.2.1Debian 1:4.2-3ubuntu6.7~cloud0

and in this cae its failng for the virtio-blk device not the memroy ballon

before the migration starts we see

-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/2 (label charserial0)
virtio: bogus descriptor or out of resources
2020-11-20 14:25:39.911+0000: initiating migration

on the destination
https://zuul.opendev.org/t/openstack/build/d50877ae15db4022b82f4bb1d1d52cea/log/logs/subnode-2/libvirt/qemu/instance-0000001a.txt

then on the source we see

char device redirected to /dev/pts/1 (label charserial0)
2020-11-20T14:25:40.720757Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0xb8 inconsistent with Host index 0xe0: delta 0xffd8
2020-11-20T14:25:40.720785Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2020-11-20T14:25:40.720790Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-blk'
2020-11-20T14:25:40.720824Z qemu-system-x86_64: load of migration failed: Operation not permitted

when it fails
https://zuul.opendev.org/t/openstack/build/d50877ae15db4022b82f4bb1d1d52cea/log/logs/libvirt/qemu/instance-0000001a.txt

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

[Copy/pasting my comment from here: https://bugs.launchpad.net/nova/+bug/1737625/comments/4]

I just talked to Dave Gilbert from upstream QEMU. Overall, as I implied in comment#2, this gnarly issue requires specialized debugging, digging deep into the bowels of QEMU, 'virtio-blk' and 'virtio.

That said, Dave notes that we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted.

Dave makes some guesswork of a potential scenario where this can occur:

  - Guest is running
  - ... live migration starts
  - ... a "block read" request gets submitteed
  - ... live migration stops the vCPUs, finishes transmitting RAM
  - ... the "block read" completes, 'virtio-blk' updates pointers
  - ... live migration "serializes" the 'virito-blk' state

So the "guest index inconsistent" state would only happen if you got unlucky with the timing of that read.

Another possibility, Dave points out, is that the guest has screwed up the device state somehow; the migration code in 'virtio' checks the state a lot. We have ruled this possibility out becausethe guest is just a garden-variety CirrOS instance idling; nothing special about it.

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

hang on, I've just noticed the :

 char device redirected to /dev/pts/2 (label charserial0)
virtio: bogus descriptor or out of resources
2020-11-20 14:25:39.911+0000: initiating migration

I've never seen that 'bogus descriptor or out of resources' one before; the fact that's happening before the migration starts is suspicious that something has already gone wrong before migration started.
Is that warning present in all these failures?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I see two recent hits we have still logs for.

The one described in comment #10

And another but there the error message is a bit different:

on the migration source host:
: https://6f0be18d925d64906a23-689ad0b9b6f06bc0c51bfb99bf86ea04.ssl.cf5.rackcdn.com/698706/4/check/nova-grenade-multinode/ee2dbea/logs/libvirt/qemu/instance-0000001b.txt

char device redirected to /dev/pts/0 (label charserial0)
virtio: zero sized buffers are not allowed
2020-11-23 22:20:54.297+0000: initiating migration

on the migration destination
https://6f0be18d925d64906a23-689ad0b9b6f06bc0c51bfb99bf86ea04.ssl.cf5.rackcdn.com/698706/4/check/nova-grenade-multinode/ee2dbea/logs/subnode-2/libvirt/qemu/instance-0000001b.txt

char device redirected to /dev/pts/0 (label charserial0)
2020-11-23T22:20:55.129189Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x62 inconsistent with Host index 0xa1: delta 0xffc1
2020-11-23T22:20:55.129230Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2020-11-23T22:20:55.129241Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-blk'
2020-11-23T22:20:55.129259Z qemu-system-x86_64: load of migration failed: Operation not permitted

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

OK, but that still says in both cases here we've got a virtio error telling us that the queues are broken before migration even starts; so we should try and figure out why that's happening first.

Revision history for this message
Thomas Huth (th-huth) wrote :

Is this still an issue with the latest release of QEMU (v6.0)?

Changed in qemu:
status: New → Incomplete
Revision history for this message
ignazio (cassano) wrote :

I got same issue on centos 7 stein

Revision history for this message
ignazio (cassano) wrote :

Hello, some news ....I wonder if they can help:
I am testing with some virtual machine again.
If I follows this steps it works (but I lost network connection):

1) Detach network interface from instance
2) Attach network interface to instance
3) Migrate instance
4) Loggin into instance using console and restart networking

while if I restart networking before live migration it does not work.
So, when someone mentioned

########################
we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted.
############################
the network traffic could be the problem ?
Ignazio

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Be careful, it might not be the same bug.

Yes, it *shouldn't* be a problem, but if the virtio code in qemu is broken then it will keep accepting incoming packets even when the guest is stopped in the final part of the migration and you get the contents of the RAM taken before the reception ofthe packet, but hte virtio state that's in the migration stream after the reception of the packet, and it's inconsistent.

But the case the other reporter mentioned is on a virtio-blk device; the same thing can happen if the storage device stalls/is slow during the migration code - i.e. a block read takes ages to complete and happens to complete after the point it should have stopped for migration.

Revision history for this message
Thomas Huth (th-huth) wrote :

Is this still happening with the latest release?

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.