OpenStack Compute (nova)

Bug #1334398
Comment #24

Comment 24 for bug 1334398

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-30:

#24

I've added some more debugging to the live snapshot code in this change:

https://review.openstack.org/#/c/103066/

When it failed in this test run:

http://logs.openstack.org/66/103066/7/check/check-tempest-dsvm-postgres-full/02e97b1

I see

2014-06-30 11:55:55.398+0000: 18078: debug : virDomainGetBlockJobInfo:19415 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, info=0x7f1315ffa390, flags=0
2014-06-30 11:55:55.415 WARNING nova.virt.libvirt.driver [req-3e682ad2-5af5-47d2-a72d-9bac23e8c2bc ListImageFiltersTestJSON-1350287421 ListImageFiltersTestJSON-1927810029] blockJobInfo snapshot cur=0 end=25165824

2014-06-30 11:55:56.074+0000: 18071: debug : virDomainGetBlockJobInfo:19415 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, info=0x7f13256ea390, flags=0
2014-06-30 11:55:56.094 WARNING nova.virt.libvirt.driver [req-3e682ad2-5af5-47d2-a72d-9bac23e8c2bc ListImageFiltersTestJSON-1350287421 ListImageFiltersTestJSON-1927810029] blockJobInfo snapshot cur=25165824 end=25165824

This shows that as far as virDomainGetBlockJobInfo is concerned, the job has completed copying the data in not very much time at all, which seems reasonable considering it is a cirros image.

We then go into a virDomainBlockJobAbort call to finish the snapshot operation:

2014-06-30 11:55:56.127+0000: 18070: debug : virDomainBlockJobAbort:19364 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, flags=0

This should take a fraction of a second, but after 3 minute it still isn't done. Tempest gets fed up waiting and so issues a call to destroy the guest:

2014-06-30 11:59:10.341+0000: 18090: debug : virDomainDestroy:2173 : dom=0x7f12f0002910, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b)

Shortly thereafter QEMU is dead and the virDomainBlockJobAbort call returns, obviously with an error,

2014-06-30 11:59:21.279 17542 TRACE nova.compute.manager [instance: b1e3c5de-af31-4a4d-94dd-ef382936583b] libvirtError: Unable to read from monitor: Connection reset by peer

So, based on this debug info I think that Nova is doing the right thing, and this is probably a bug in QEMU (or possibly, but unlikely, a bug in libvirt). My inclination is that QEMU is basically hanging in the block job abort call, due to some fairly infrequently hit race condition.