ceph: volume detach fails with "libvirtError: operation failed: disk vdb not found"

Bug #1642689 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
int32bit
Newton
Fix Committed
Medium
Matt Riedemann

Bug Description

Seeing this failure in this job:

tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use[compute,id-b467b54c-07a4-446d-a1cf-651dedcc3ff1]

While detaching a volume on cleanup it times out waiting for the volume status to go from 'in-use' back to 'available':

2016-11-17 11:58:59.110301 | Captured traceback:
2016-11-17 11:58:59.110310 | ~~~~~~~~~~~~~~~~~~~
2016-11-17 11:58:59.110322 | Traceback (most recent call last):
2016-11-17 11:58:59.110342 | File "tempest/common/waiters.py", line 189, in wait_for_volume_status
2016-11-17 11:58:59.110356 | raise lib_exc.TimeoutException(message)
2016-11-17 11:58:59.110373 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-11-17 11:58:59.110405 | Details: Volume db12eda4-4ce6-4f00-a4e0-9df115f230e5 failed to reach available status (current in-use) within the required time (196 s).

The volume detach request is here:

2016-11-17 11:58:59.031058 | 2016-11-17 11:38:55,018 8316 INFO [tempest.lib.common.rest_client] Request (VolumesV1SnapshotTestJSON:_run_cleanups): 202 DELETE http://127.0.0.1:8774/v2.1/servers/584a65b5-07fa-4994-a2d5-1676d0e13a8c/os-volume_attachments/db12eda4-4ce6-4f00-a4e0-9df115f230e5 0.277s
2016-11-17 11:58:59.031103 | 2016-11-17 11:38:55,018 8316 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2016-11-17 11:58:59.031113 | Body: None
2016-11-17 11:58:59.031235 | Response - Headers: {'content-location': 'http://127.0.0.1:8774/v2.1/servers/584a65b5-07fa-4994-a2d5-1676d0e13a8c/os-volume_attachments/db12eda4-4ce6-4f00-a4e0-9df115f230e5', 'content-type': 'application/json', 'x-openstack-nova-api-version': '2.1', 'date': 'Thu, 17 Nov 2016 11:38:55 GMT', 'content-length': '0', 'status': '202', 'connection': 'close', 'x-compute-request-id': 'req-9f0541d3-6eec-4793-8852-7bd01708932e', 'openstack-api-version': 'compute 2.1', 'vary': 'X-OpenStack-Nova-API-Version'}
2016-11-17 11:58:59.031248 | Body:

Following the req-9f0541d3-6eec-4793-8852-7bd01708932e request ID to the compute logs we see this detach failure:

http://logs.openstack.org/00/398800/1/gate/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/a387fb0/logs/screen-n-cpu.txt.gz?level=TRACE#_2016-11-17_11_39_00_649

2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [req-9f0541d3-6eec-4793-8852-7bd01708932e tempest-VolumesV1SnapshotTestJSON-1819335716 tempest-VolumesV1SnapshotTestJSON-1819335716] [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] Failed to detach volume db12eda4-4ce6-4f00-a4e0-9df115f230e5 from /dev/vdb
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] Traceback (most recent call last):
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/opt/stack/new/nova/nova/compute/manager.py", line 4757, in _driver_detach_volume
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] encryption=encryption)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1307, in detach_volume
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] wait_for_detach()
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 385, in func
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] return evt.wait()
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] return hubs.get_hub().switch()
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] return self.greenlet.switch()
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] result = func(*self.args, **self.kw)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 356, in _func
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] result = f(*args, **kwargs)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 380, in _do_wait_and_retry_detach
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] self.detach_device(config, persistent=False, live=live)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 401, in detach_device
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] self._domain.detachDeviceFlags(device_xml, flags=flags)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] result = proxy_call(self._autowrap, f, *args, **kwargs)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] rv = execute(f, *args, **kwargs)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] six.reraise(c, e, tb)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] rv = meth(*args, **kwargs)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1190, in detachDeviceFlags
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c] libvirtError: operation failed: disk vdb not found
2016-11-17 11:39:00.649 2249 ERROR nova.compute.manager [instance: 584a65b5-07fa-4994-a2d5-1676d0e13a8c]

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Failed%20to%20detach%20volume%5C%22%20AND%20message%3A%5C%22_do_wait_and_retry_detach%5C%22%20AND%20message%3A%5C%22libvirtError%3A%20operation%20failed%3A%20disk%5C%22%20AND%20message%3A%5C%22not%20found%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

21 hits in the last 7 days, check and gate, all failures, 90% in ceph jobs.

Looks like that wait for detach code should be handling the not found from libvirt and considering the job complete.

int32bit (int32bit)
Changed in nova:
assignee: nobody → int32bit (int32bit)
Revision history for this message
Matt Riedemann (mriedem) wrote :

@int32bit, are you working on a fix for this? If not, please remove yourself as assignee so someone else can try to fix this.

Revision history for this message
int32bit (int32bit) wrote :

Yes, I'm working on it, but if someone have fixed it, please free to re-assign it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/401375

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: int32bit (int32bit) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → int32bit (int32bit)
Changed in nova:
assignee: int32bit (int32bit) → melanie witt (melwitt)
melanie witt (melwitt)
Changed in nova:
assignee: melanie witt (melwitt) → int32bit (int32bit)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/404508

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/401375
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d079f37f7ccb8da9e74416cedce4f194b66ee1d0
Submitter: Jenkins
Branch: master

commit d079f37f7ccb8da9e74416cedce4f194b66ee1d0
Author: int32bit <email address hidden>
Date: Wed Nov 23 18:21:16 2016 +0800

    Fix wait for detach code to handle 'disk not found error'

    Currently, the code does an initial detach from the persistent and
    transient domains in one call, then follows up with a call of the
    retryable function, which first checks the domain xml before retrying
    the transient domain detach. If the transient domain detach (which is
    asynchronous) completes after we checked the domain xml, trying to
    detach it again will raise a LibvirtError exception. Then, our loop
    code in `_do_wait_and_retry_detach` doesn't catch it and will respond
    with error. We should be handling the `disk not found error` from
    libvirt and consider the job complete.

    Closes-Bug: #1642689

    Change-Id: I131aaf28d2f5d5d964d4045e3d7d62207079cfb0

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/404508
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=93599acb968e15b3bd977dafc294e1b48d50c8f0
Submitter: Jenkins
Branch: stable/newton

commit 93599acb968e15b3bd977dafc294e1b48d50c8f0
Author: int32bit <email address hidden>
Date: Wed Nov 23 18:21:16 2016 +0800

    Fix wait for detach code to handle 'disk not found error'

    Currently, the code does an initial detach from the persistent and
    transient domains in one call, then follows up with a call of the
    retryable function, which first checks the domain xml before retrying
    the transient domain detach. If the transient domain detach (which is
    asynchronous) completes after we checked the domain xml, trying to
    detach it again will raise a LibvirtError exception. Then, our loop
    code in `_do_wait_and_retry_detach` doesn't catch it and will respond
    with error. We should be handling the `disk not found error` from
    libvirt and consider the job complete.

    Closes-Bug: #1642689

    Change-Id: I131aaf28d2f5d5d964d4045e3d7d62207079cfb0
    (cherry picked from commit d079f37f7ccb8da9e74416cedce4f194b66ee1d0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b2

This issue was fixed in the openstack/nova 15.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.3

This issue was fixed in the openstack/nova 14.0.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.