live migration failed with "Failed to load virtio-blk:virtio" (libvirt/qemu)

Bug #1737625 reported by Matt Riedemann
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
Unassigned

Bug Description

This CI job failed on a Queens patch:

http://logs.openstack.org/07/516707/23/check/legacy-tempest-dsvm-multinode-live-migration/d9089b2/job-output.txt.gz#_2017-12-11_20_29_09_364369

This is the corresponding QEMU guest log:

http://logs.openstack.org/07/516707/23/check/legacy-tempest-dsvm-multinode-live-migration/d9089b2/logs/libvirt/qemu/instance-0000000a.txt.gz

2017-12-11T20:28:17.493781Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x11f inconsistent with Host index 0x121: delta 0xfffe
2017-12-11T20:28:17.493803Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2017-12-11T20:28:17.493807Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-blk'
2017-12-11T20:28:17.493874Z qemu-system-x86_64: load of migration failed: Operation not permitted
2017-12-11 20:28:17.494+0000: shutting down, reason=failed

This is the failure in the nova-compute logs:

http://logs.openstack.org/07/516707/23/check/legacy-tempest-dsvm-multinode-live-migration/d9089b2/logs/subnode-2/screen-n-cpu.txt.gz?level=TRACE#_Dec_11_20_28_17_768795

Dec 11 20:28:17.768795 ubuntu-xenial-ovh-bhs1-0001369824 nova-compute[2304]: ERROR nova.virt.libvirt.driver [None req-8bd84a16-a341-441f-afbb-b35c3decb59a service nova] [instance: 58d015a2-4793-497c-a973-99b9dc881f04] Live Migration failure: operation failed: job: unexpectedly failed: libvirtError: operation failed: job: unexpectedly failed
Dec 11 20:28:17.834511 ubuntu-xenial-ovh-bhs1-0001369824 nova-compute[2304]: ERROR nova.virt.libvirt.driver [None req-8bd84a16-a341-441f-afbb-b35c3decb59a service nova] [instance: 58d015a2-4793-497c-a973-99b9dc881f04] Migration operation has aborted

Matt Riedemann (mriedem)
tags: added: live-migration
removed: live-
Changed in nova:
status: New → Confirmed
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

From another occurrence (reported by Lee Yarwood on IRC; as part of stable/pike), the full error from the source QEMU log

http://logs.openstack.org/07/531407/2/check/legacy-tempest-dsvm-multinode-live-migration/7d66bcf/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz
-------------------------------------------------------------------------------------------------
2018-01-09 00:57:31.473+0000: starting up libvirt version: 2.5.0, package: 3ubuntu5.6~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Wed, 08 Nov 2017 13:56:28 +0000), qemu version: 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.9~cloud1), hostname: ubuntu-xenial-ovh-gra1-0001744892
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-x86_64 -name guest=instance-00000002,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-4-instance-00000002/master-key.aes -machine pc-i440fx-zesty,accel=tcg,usb=off,dump-guest-core=off -m 64 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid defdc790-3c2a-48d3-8903-3ab7e196c9b7 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=16.0.5,serial=17f2e748-07d1-4394-95f5-3d7b99426b23,uuid=defdc790-3c2a-48d3-8903-3ab7e196c9b7,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-4-instance-00000002/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/opt/stack/data/nova/instances/defdc790-3c2a-48d3-8903-3ab7e196c9b7/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:fd:72:44,bus=pci.0,addr=0x3 -add-fd set=1,fd=32 -chardev pty,id=charserial0,logfile=/dev/fdset/1,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
char device redirected to /dev/pts/0 (label charserial0)
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
2018-01-09T00:57:36.515658Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x12b inconsistent with Host index 0x133: delta 0xfff8
2018-01-09T00:57:36.515716Z qemu-system-x86_64: Failed to load virtio-blk:virtio
2018-01-09T00:57:36.515724Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-blk'
2018-01-09T00:57:36.516158Z qemu-system-x86_64: load of migration failed: Operation not permitted
2018-01-09 00:57:36.612+0000: shutting down, reason=failed
-------------------------------------------------------------------------------------------------

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Not sure if we have a reliable reproducer here.

But talking to Stefan Hajnoczi from the QEMU team, who fixed similar issues earlier in this area, he says: "This issue probably requires debugging. This kind of problem can occur if the vring memory is not up-to-date (e.g. requests were processed after migration paused the guest)."

A similar bug (bug different error message: "Virtqueue size exceeded") in this area:

    https://bugzilla.redhat.com/show_bug.cgi?id=1378788 -- Race condition during virtio-blk
    dataplane stop triggers "Virtqueue size exceeded"

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Download full text (7.2 KiB)

Fast forward more than two-and-a-half years, another hit[1] upstream (in Zuul):
---------------------------------------------------------------
2020-11-20 14:25:11.589+0000: starting up libvirt version: 6.0.0, package: 0ubuntu8.4~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Tue, 15 Sep 2020 20:36:28 +0000), qemu version: 4.2.1Debian 1:4.2-3ubuntu6.7~cloud0, kernel: 4.15.0-124-generic, hostname: ubuntu-bionic-ovh-bhs1-0021872194
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
HOME=/var/lib/libvirt/qemu/domain-10-instance-0000001a \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-10-instance-0000001a/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-10-instance-0000001a/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-10-instance-0000001a/.config \
QEMU_AUDIO_DRV=none \
/usr/bin/qemu-system-x86_64 \
-name guest=instance-0000001a,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-10-instance-0000001a/master-key.aes \
-machine pc-i440fx-4.2,accel=tcg,usb=off,dump-guest-core=off \
-cpu qemu64 \
-m 128 \
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 2c468d92-4b19-426a-8c25-16b4624c21a4 \
-smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=22.1.0,serial=2c468d92-4b19-426a-8c25-16b4624c21a4,uuid=2c468d92-4b19-426a-8c25-16b4624c21a4,family=Virtual Machine' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=32,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-blockdev '{"driver":"file","filename":"/opt/stack/data/nova/instances/_base/61bd5e531ab4c82456aa5300ede7266b3610be79","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-blockdev '{"driver":"file","filename":"/opt/stack/data/nova/instances/2c468d92-4b19-426a-8c25-16b4624c21a4/disk","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-2-format"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on \
-netdev tap,fd=34,id=hostnet0 \
-device virtio-net-pci,host_mtu=1400,netdev=hostnet0,id=net0,mac=fa:16:3e:43:11:f4,bus=pci.0,addr=0x3 \
-add-fd set=2,fd=36 \
-chardev pty,id=charserial0,logfile=/dev/fdset/2,logappend=on \
-device isa-serial,chardev=charserial0,id=serial0 \
-vnc 0.0.0.0:0 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-object rng-random,id=objrng0,filename=/dev/urandom \
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x6 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg tim...

Read more...

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

I just talked to Dave Gilbert from upstream QEMU. Overall, as I implied in comment#2, this gnarly issue requires specialized debugging, digging deep into the bowels of QEMU, 'virtio-blk' and 'virtio.

That said, Dave notes that we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted.

Dave makes some guesswork of a potential scenario where this can occur:

  - Guest is running
  - ... live migration starts
  - ... a "block read" request gets submitteed
  - ... live migration stops the vCPUs, finishes transmitting RAM
  - ... the "block read" completes, 'virtio-blk' updates pointers
  - ... live migration "serializes" the 'virito-blk' state

So the "guest index inconsistent" state would only happen if you got unlucky with the timing of that read.

Another possibility, Dave points out, is that the guest has screwed up the device state somehow; the migration code in 'virtio' checks the state a lot. We have ruled this possibility out becausethe guest is just a garden-variety CirrOS instance idling; nothing special about it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.