swap volume intermittently fails with "libvirtError: block copy still active: domain has active block job"

Bug #1630600 reported by Matt Riedemann
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Lee Yarwood

Bug Description

We now have a tempest test that tests the nova swap volume API and it intermittently fails waiting for the first attached volume to be detached and go to from 'in-use' to 'available' status and that's because the swap volume operation fails in the compute:

http://logs.openstack.org/73/374373/2/check/gate-tempest-dsvm-full-ubuntu-xenial/149fe3e/logs/screen-n-cpu.txt.gz?level=TRACE#_2016-10-04_15_54_52_078

2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [req-4e0ea92b-ab6d-47aa-88ba-046d1b880907 tempest-TestVolumeSwap-1737812056 tempest-TestVolumeSwap-1737812056] [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] Failed to swap volume 690b7f29-5c8c-4fb6-8ae4-308a2fb1959c for 7b86913d-78e4-4b68-8df2-63f1a4c6656a
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] Traceback (most recent call last):
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/opt/stack/new/nova/nova/compute/manager.py", line 4870, in _swap_volume
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] resize_to)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1230, in swap_volume
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] self._swap_volume(guest, disk_dev, conf.source_path, resize_to)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1200, in _swap_volume
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] self._host.write_instance_config(xml)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 860, in write_instance_config
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] domain = self.get_connection().defineXML(xml)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] result = proxy_call(self._autowrap, f, *args, **kwargs)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] rv = execute(f, *args, **kwargs)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] six.reraise(c, e, tb)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] rv = meth(*args, **kwargs)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 3650, in defineXML
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self)
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f] libvirtError: block copy still active: domain has active block job
2016-10-04 15:54:52.078 15500 ERROR nova.compute.manager [instance: e0d9f1ff-f5ab-4afe-9eff-0a455006ec3f]

It looks like we're not waiting long enough for the job to complete, we think the abort_job in this case:

https://github.com/openstack/nova/blob/0f4bd241665c287e49f2d30ca79be96298217b7e/nova/virt/libvirt/driver.py#L1188

Because the tempest test is using volumes that are the same size, resize_to will be 0 and thus falsey and we won't get into the 2nd block that does a wait loop.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
assignee: nobody → Matthew Booth (mbooth-9)
status: New → In Progress
importance: Undecided → Medium
tags: added: newton-backport-potential
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

After some discussion with libvirt / QEMU folks, it turns out there's something that libvirt could do, so that the pivot operation (after a blockRebase() has completed) could succeed.

    https://bugzilla.redhat.com/show_bug.cgi?id=1382165 -- virDomainGetBlockJobInfo:
    Adjust job reporting based on QEMU stats & the "ready" field of `query-block-jobs`

Also, FWIW, I posted the libvirt / QEMU traffic analysis here from the libvirt debug logs for the failure this bug is reporting

   http://lists.openstack.org/pipermail/openstack-dev/2016-October/105158.html

Changed in nova:
assignee: Matthew Booth (mbooth-9) → Diana Clarke (diana-clarke)
Changed in nova:
assignee: Diana Clarke (diana-clarke) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/382449
Reason: Looks abandoned and not needed anymore.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.