This shows that as far as virDomainGetBlockJobInfo is concerned, the job has completed copying the data in not very much time at all, which seems reasonable considering it is a cirros image.
We then go into a virDomainBlockJobAbort call to finish the snapshot operation:
This should take a fraction of a second, but after 3 minute it still isn't done. Tempest gets fed up waiting and so issues a call to destroy the guest:
Shortly thereafter QEMU is dead and the virDomainBlockJobAbort call returns, obviously with an error,
2014-06-30 11:59:21.279 17542 TRACE nova.compute.manager [instance: b1e3c5de-af31-4a4d-94dd-ef382936583b] libvirtError: Unable to read from monitor: Connection reset by peer
So, based on this debug info I think that Nova is doing the right thing, and this is probably a bug in QEMU (or possibly, but unlikely, a bug in libvirt). My inclination is that QEMU is basically hanging in the block job abort call, due to some fairly infrequently hit race condition.
I've added some more debugging to the live snapshot code in this change:
https:/ /review. openstack. org/#/c/ 103066/
When it failed in this test run:
http:// logs.openstack. org/66/ 103066/ 7/check/ check-tempest- dsvm-postgres- full/02e97b1
I see
2014-06-30 11:55:55.398+0000: 18078: debug : virDomainGetBlo ckJobInfo: 19415 : dom=0x7f13200018f0, (VM: name=instance- 00000020, uuid=b1e3c5de- af31-4a4d- 94dd-ef38293658 3b), disk=/opt/ stack/data/ nova/instances/ b1e3c5de- af31-4a4d- 94dd-ef38293658 3b/disk, info=0x7f1315ff a390, flags=0 libvirt. driver [req-3e682ad2- 5af5-47d2- a72d-9bac23e8c2 bc ListImageFilter sTestJSON- 1350287421 ListImageFilter sTestJSON- 1927810029] blockJobInfo snapshot cur=0 end=25165824
2014-06-30 11:55:55.415 WARNING nova.virt.
2014-06-30 11:55:56.074+0000: 18071: debug : virDomainGetBlo ckJobInfo: 19415 : dom=0x7f13200018f0, (VM: name=instance- 00000020, uuid=b1e3c5de- af31-4a4d- 94dd-ef38293658 3b), disk=/opt/ stack/data/ nova/instances/ b1e3c5de- af31-4a4d- 94dd-ef38293658 3b/disk, info=0x7f13256e a390, flags=0 libvirt. driver [req-3e682ad2- 5af5-47d2- a72d-9bac23e8c2 bc ListImageFilter sTestJSON- 1350287421 ListImageFilter sTestJSON- 1927810029] blockJobInfo snapshot cur=25165824 end=25165824
2014-06-30 11:55:56.094 WARNING nova.virt.
This shows that as far as virDomainGetBlo ckJobInfo is concerned, the job has completed copying the data in not very much time at all, which seems reasonable considering it is a cirros image.
We then go into a virDomainBlockJ obAbort call to finish the snapshot operation:
2014-06-30 11:55:56.127+0000: 18070: debug : virDomainBlockJ obAbort: 19364 : dom=0x7f13200018f0, (VM: name=instance- 00000020, uuid=b1e3c5de- af31-4a4d- 94dd-ef38293658 3b), disk=/opt/ stack/data/ nova/instances/ b1e3c5de- af31-4a4d- 94dd-ef38293658 3b/disk, flags=0
This should take a fraction of a second, but after 3 minute it still isn't done. Tempest gets fed up waiting and so issues a call to destroy the guest:
2014-06-30 11:59:10.341+0000: 18090: debug : virDomainDestro y:2173 : dom=0x7f12f0002910, (VM: name=instance- 00000020, uuid=b1e3c5de- af31-4a4d- 94dd-ef38293658 3b)
Shortly thereafter QEMU is dead and the virDomainBlockJ obAbort call returns, obviously with an error,
2014-06-30 11:59:21.279 17542 TRACE nova.compute. manager [instance: b1e3c5de- af31-4a4d- 94dd-ef38293658 3b] libvirtError: Unable to read from monitor: Connection reset by peer
So, based on this debug info I think that Nova is doing the right thing, and this is probably a bug in QEMU (or possibly, but unlikely, a bug in libvirt). My inclination is that QEMU is basically hanging in the block job abort call, due to some fairly infrequently hit race condition.