The log attached in comment #15 is the failed case. And, as I guessed
in comment #12, if you look at the log file from comment #15 the
'block-job-cancel' command is issued too soon, before QEMU reports the
block device job status as "ready" true.
So, with request id "libvirt-14", when libvirt queries QEMU for the
block device job status:
Instead of waiting for the block device job status to turn to "ready":
true, Nova goes ahead and aborts the job (with 'block-job-cancel'), with
id "libvirt-15":
2016-09-01 11:11:43.058+0000: 29193: info : qemuMonitorIOWrite:528 :
QEMU_MONITOR_IO_WRITE: mon=0x7f98c4000e00
buf={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-15"}
So, this (aborting the job before it's ready) is causing the snapshot to
be corrupted. If Nova waits a little longer, then the snapshot is
successful (as it can be seen from my log analysis in comment #13, which
is the good case).
The log attached in comment #15 is the failed case. And, as I guessed
in comment #12, if you look at the log file from comment #15 the
'block-job-cancel' command is issued too soon, before QEMU reports the
block device job status as "ready" true.
So, with request id "libvirt-14", when libvirt queries QEMU for the
block device job status:
2016-09-01 11:11:43.053+0000: 29193: info : qemuMonitorIOWr ite:528 : MONITOR_ IO_WRITE: mon=0x7f98c4000e00 "execute" :"query- block-jobs" ,"id":" libvirt- 14"} len=50 ret=50
QEMU_
buf={
errno=11
It reports that it is not ready yet (see -- "ready": false):
2016-09-01 11:11:43.057+0000: 29193: info : qemuMonitorIOPr ocess:423 : MONITOR_ IO_PROCESS: mon=0x7f98c4000e00 buf={"return": virtio- disk0", "busy": true,
QEMU_
[{"io-status": "ok", "device": "drive-
"len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false,
"type": "mirror"}], "id": "libvirt-14"} len=187
Instead of waiting for the block device job status to turn to "ready": job-cancel' ), with
true, Nova goes ahead and aborts the job (with 'block-
id "libvirt-15":
2016-09-01 11:11:43.058+0000: 29193: info : qemuMonitorIOWr ite:528 : MONITOR_ IO_WRITE: mon=0x7f98c4000e00 "execute" :"block- job-cancel" ,"arguments" :{"device" :"drive- virtio- disk0"} ,"id":" libvirt- 15"}
QEMU_
buf={
So, this (aborting the job before it's ready) is causing the snapshot to
be corrupted. If Nova waits a little longer, then the snapshot is
successful (as it can be seen from my log analysis in comment #13, which
is the good case).