Bug #1627134 “libvirt driver stuck deleting online snapshot” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-09-23:

#1

https://review.openstack.org/#/c/375652/

Changed in nova:
status:	New → In Progress
assignee:	nobody → Evgeny Antyshev (eantyshev)
tags:	added: libvirt
Changed in nova:
importance:	Undecided → High

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-09-23:

#2

I think this probably only hits certain volume backends, someone else was reporting the same thing using NFS, and I think virtuozzo is using a RemoteFS style driver too.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2016-09-26:

#3

I think libvirt method virDomainGetBlockJobInfo shouldn't return empty dict. Please check my patch: https://review.openstack.org/#/c/365756/ which tries to clean this code littlebit. Maybe it would help somehow?

Revision history for this message

Evgeny Antyshev (eantyshev) wrote on 2016-09-26:

#4

This code won't work: blockJobInfo() returns {} after rebase job is complete.
The proposed code will fail any matching in wait_for_job() when {} is passed.

I've checked this on my third-party CI (for Virtuozzo Storage volume backend). I also discovered that this code isn't checked in upstream gate, but only with unit tests.

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2016-09-28:

#5

Currently Virtuozzo driver doesn't log libvirt debug log filters
http://openstack-3rd-party-storage-ci-logs.virtuozzo.com/28/314928/13/check/dsvm-tempest-kvm/5aae7aa/logs/libvirt/libvirtd.txt.gz

Can you also please add libvirt debug logs with log filters enabled, so libvirt folks could take a look:

(1) In /etc/libvirt/libvirtd.conf:

log_filters="1:libvirt.c 1:qemu 1:conf 1:security 3:object 3:event 3:json 3:file 1:util 1:cpu"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

(2) Restart libvirtd

(3) Repeat the test

(4) Attach the libvirtd.log file here.

Revision history for this message

Evgeny Antyshev (eantyshev) wrote on 2016-09-28:

#6

Here's a libvirtd log for the period since rebase job creation till its termination

http://paste.openstack.org/show/583277/

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2016-09-28:

#7

Attaching the log from comment#6 (Paste-bins expire) Edit (20.7 KiB, text/plain)

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2016-09-28:

#8

Looking at the log...

Here we can the libvirt blockRebase() (QEMU 'block-stream') job has started:

-----
Sep 20 15:10:39 localhost libvirtd: 85852: info : qemuMonitorSend:1007 : QEMU_MONITOR_SEND_MSG: mon=0x7f64f8010bc0 msg={"execute":"block-stream","arguments":{"device":"drive-virtio-disk1","base":"/opt/stack/data/nova/mnt/530180c6e38955fc032f76c2c0b8b443/volume-0a1fd4cf-643b-4077-943d-69a6647ccf9b.8ffec8b3-fba4-4b41-969f-640dc7ea792a","backing-file":"volume-0a1fd4cf-643b-4077-943d-69a6647ccf9b.8ffec8b3-fba4-4b41-969f-640dc7ea792a"},"id":"libvirt-109"}#015#012 fd=-1
-----

libvirt queries QEMU for block job status:

-----
Sep 20 15:10:39 localhost libvirtd: 85849: info : qemuMonitorIOWrite:531 : QEMU_MONITOR_IO_WRITE: mon=0x7f64f8010bc0 buf={"execute":"query-block-jobs","id":"libvirt-110"}#015#012 len=51 ret=51 errno=11
-----

QEMU responds, saying that it is busy (and you can see, that len != offset; so it is still copying):

-----
Sep 20 15:10:39 localhost libvirtd: 85849: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_REPLY: mon=0x7f64f8010bc0 reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk1", "busy": false, "len": 1073741824, "offset": 613941248, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "libvirt-110"}
-----

Shortly after that, we can see that the "len == offset", meaning copy job (type "stream") is finished, and QEMU has emitted the event BLOCK_JOB_COMPLETED:

-----
Sep 20 15:10:39 localhost libvirtd: 85849: info : qemuMonitorEmitEvent:1272 : mon=0x7f64f8010bc0 event=BLOCK_JOB_COMPLETED data={"device":"drive-virtio-disk1","len":1073741824,"offset":1073741824,"speed":0,"type":"stream"}
-----

Now libvirt queries QEMU for job status (line-148) once again:

-----
Sep 20 15:10:39 localhost libvirtd: 85854: info : qemuMonitorSend:1007 : QEMU_MONITOR_SEND_MSG: mon=0x7f64f8010bc0 msg={"execute":"query-block-jobs","id":"libvirt-111"}#015#012 fd=-1
-----

Response (notice the ID 'libvirt-111' matches the above query), is an empty dict:

-----
Sep 20 15:10:39 localhost libvirtd: 85849: info : qemuMonitorIOProcess:426 : QEMU_MONITOR_IO_PROCESS: mon=0x7f64f8010bc0 buf={"return": [], "id": "libvirt-111"}#015#012 len=37
-----

The above response from QEMU means: no jobs are active; so libvirt should _not_ be reporting a job status, with current libvirt versions.

Upstream libvirt says that maybe future libvirt could track that a job has finished and provide state even when QEMU didn't report any. (I'll file an upstream libvirt bug to track this).