Volume is hangs on IN-USE state during detaching.

Bug #1586065 reported by Pavel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
Medium
Pavel
9.x
Invalid
Medium
Pavel

Bug Description

Detailed bug description:
Problem discovered on MOS 9.0 ISO #310

Problem was discovered during execution of CinderVolumes.create_and_attach_volume: one volume hangs in IN-USE state on detaching from VM.

Here are parts from nova-compute-log and rally below. It shows us how volume is raising an error during detaching.

Attempting claim:
2016-05-20 13:33:36.083 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Attempting claim: memory 64 MB, disk 0 GB, vcpus 1 CPU
2016-05-20 13:33:36.084 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Total memory: 31979 MB, used: 6656.00 MB
2016-05-20 13:33:36.085 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] memory limit: 31979.00 MB, free: 25323.00 MB
2016-05-20 13:33:36.085 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Total disk: 2791 GB, used: 60.00 GB
2016-05-20 13:33:36.086 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] disk limit: 2791.00 GB, free: 2731.00 GB
2016-05-20 13:33:36.086 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Total vcpu: 12 VCPU, used: 3.00 VCPU
2016-05-20 13:33:36.086 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] vcpu limit: 96.00 VCPU, free: 93.00 VCPU
2016-05-20 13:33:36.102 28838 INFO nova.compute.claims [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Claim successful

Creating image, spawned successfully:
2016-05-20 13:33:39.103 28838 INFO nova.virt.libvirt.driver [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Creating image
2016-05-20 13:34:15.033 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Auditing locally available compute resources for node node-29.domain.tld
2016-05-20 13:34:16.953 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Total usable vcpus: 12, total allocated vcpus: 4
2016-05-20 13:34:16.954 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Final resource view: name=node-29.domain.tld phys_ram=31979MB used_ram=6720MB phys_disk=2791GB used_disk=60GB total_vcpus=12 used_vcpus=4 pci_stats=[]
2016-05-20 13:34:18.547 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Compute_service record updated for node-29.domain.tld:node-29.domain.tld
2016-05-20 13:35:04.322 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Auditing locally available compute resources for node node-29.domain.tld
2016-05-20 13:35:04.683 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Total usable vcpus: 12, total allocated vcpus: 4
2016-05-20 13:35:04.684 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Final resource view: name=node-29.domain.tld phys_ram=31979MB used_ram=6720MB phys_disk=2791GB used_disk=60GB total_vcpus=12 used_vcpus=4 pci_stats=[]
2016-05-20 13:35:04.728 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Compute_service record updated for node-29.domain.tld:node-29.domain.tld
2016-05-20 13:35:33.562 28838 INFO nova.compute.manager [req-b571b420-6dc3-4aa8-9de0-5533e04b7451 - - - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] VM Resumed (Lifecycle Event)
2016-05-20 13:35:33.569 28838 INFO nova.virt.libvirt.driver [-] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Instance spawned successfully.

Еime spent on a building:
2016-05-20 13:35:33.570 28838 INFO nova.compute.manager [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Took 114.47 seconds to spawn the instance on the hypervisor.
2016-05-20 13:35:33.681 28838 INFO nova.compute.manager [req-8d3945e5-f004-4670-a4be-50538f67a934 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Took 117.60 seconds to build instance.

Starting VM:
2016-05-20 13:35:33.763 28838 INFO nova.compute.manager [req-b571b420-6dc3-4aa8-9de0-5533e04b7451 - - - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] VM Started (Lifecycle Event)

Attaching volume:
2016-05-20 13:36:04.229 28838 INFO nova.compute.manager [req-340255d5-1d83-4ebb-acdc-d09718be32db 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Attaching volume 85c29a5c-a719-43d1-9a62-57fa5037bb01 to /dev/vdb
2016-05-20 13:36:06.317 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Auditing locally available compute resources for node node-29.domain.tld
2016-05-20 13:36:06.715 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Total usable vcpus: 12, total allocated vcpus: 4
2016-05-20 13:36:06.716 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Final resource view: name=node-29.domain.tld phys_ram=31979MB used_ram=6720MB phys_disk=2791GB used_disk=60GB total_vcpus=12 used_vcpus=4 pci_stats=[]
2016-05-20 13:36:06.749 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Compute_service record updated for node-29.domain.tld:node-29.domain.tld

Trying to detach volume:
2016-05-20 13:36:26.174 28838 INFO nova.compute.manager [req-8010f3b6-0072-4cdb-a198-7c95f1044681 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Detach volume 85c29a5c-a719-43d1-9a62-57fa5037bb01 from mountpoint /dev/vdb
2016-05-20 13:37:06.319 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Auditing locally available compute resources for node node-29.domain.tld
2016-05-20 13:37:06.669 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Total usable vcpus: 12, total allocated vcpus: 4
2016-05-20 13:37:06.670 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Final resource view: name=node-29.domain.tld phys_ram=31979MB used_ram=6720MB phys_disk=2791GB used_disk=60GB total_vcpus=12 used_vcpus=4 pci_stats=[]
2016-05-20 13:37:06.705 28838 INFO nova.compute.resource_tracker [req-f6c6a127-b56f-49ad-ad42-136c1298639f - - - - -] Compute_service record updated for node-29.domain.tld:node-29.domain.tld

And catching the error:
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall._func' failed
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall Traceback (most recent call last):
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 377, in _func
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall return self._sleep_time
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall self.force_reraise()
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 356, in _func
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall result = f(*args, **kwargs)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 342, in _do_wait_and_retry_detach
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall reason=reason)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain.)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [req-8010f3b6-0072-4cdb-a198-7c95f1044681 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Failed to detach volume 85c29a5c-a719-43d1-9a62-57fa5037bb01 from /dev/vdb
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Traceback (most recent call last):
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4775, in _driver_detach_volume
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] encryption=encryption)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1469, in detach_volume
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] wait_for_detach()
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 385, in func
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] return evt.wait()
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] return hubs.get_hub().switch()
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] return self.greenlet.switch()
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] result = func(*self.args, **self.kw)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 377, in _func
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] return self._sleep_time
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] self.force_reraise()
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] six.reraise(self.type_, self.value, self.tb)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 356, in _func
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] result = f(*args, **kwargs)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 342, in _do_wait_and_retry_detach
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] reason=reason)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain.)
2016-05-20 13:38:07.215 28838 ERROR nova.compute.manager [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327]

Rally test results of this part of work.

The penult answer:
RESP BODY: {"volume": {"status": "in-use", "user_id": "51adf8b5919b437c87c375f25c6b31f8", "attachments": [{"server_id": "92e8cec9-8e8f-4089-ac44-5b661c3fc327", "attachment_id": "498a41b6-0b8d-466d-bead-35a67391501c", "attached_at": "2016-05-20T11:36:25.000000", "host_name": null, "volume_id": "85c29a5c-a719-43d1-9a62-57fa5037bb01", "device": "/dev/vdb", "id": "85c29a5c-a719-43d1-9a62-57fa5037bb01"}], "links": [{"href": "http://172.16.45.102:8776/v2/03b73860b54545eaa01b6c9af607597f/volumes/85c29a5c-a719-43d1-9a62-57fa5037bb01", "rel": "self"}, {"href": "http://172.16.45.102:8776/03b73860b54545eaa01b6c9af607597f/volumes/85c29a5c-a719-43d1-9a62-57fa5037bb01", "rel": "bookmark"}], "availability_zone": "nova", "bootable": "false", "encrypted": false, "created_at": "2016-05-20T11:35:35.000000", "description": null, "os-vol-tenant-attr:tenant_id": "03b73860b54545eaa01b6c9af607597f", "updated_at": "2016-05-20T11:38:07.000000", "volume_type": null, "name": "s_rally_d404c8e8_SJMMPWIQ", "replication_status": "disabled", "consistencygroup_id": null, "source_volid": null, "snapshot_id": null, "multiattach": false, "metadata": {"readonly": "False", "attached_mode": "rw"}, "id": "85c29a5c-a719-43d1-9a62-57fa5037bb01", "size": 10}}

And next answer:
REQ: curl -i http://172.16.45.102:8776/v2/93a273ff545c455db6dacb03b13ec042/volumes/d03c6503-2b12-4228-b24d-1e3eb3bf0bb2 -X GET-H 'X-Auth-Project-Id: c_rally_d404c8e8_0DRk15vc'-H 'User-Agent: python-cinderclient'-H 'Accept: application/json'-H 'X-Auth-Token: {SHA1}130f54b6d2f3ad96a145e5756ed7ed75e8081e17'
 http_log_req /opt/stack/.venv/lib/python2.7/site-packages/cinderclient/client.py:236
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner [-] Rally tired waiting for Volume s_rally_d404c8e8_SJMMPWIQ:85c29a5c-a719-43d1-9a62-57fa5037bb01 to become ('AVAILABLE') current status IN-USE
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner Traceback (most recent call last):
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/task/runner.py", line 66, in _run_scenario_once
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner deprecated_output = getattr(scenario_inst, method_name)(**kwargs)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/logging.py", line 266, in wrapper
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner result = f(*args, **kwargs)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/plugins/openstack/scenarios/cinder/volumes.py", line 316, in create_and_attach_volume
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner self._detach_volume(server, volume)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/task/atomic.py", line 84, in func_atomic_actions
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner f = func(self, *args, **kwargs)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/plugins/openstack/scenarios/nova/utils.py", line 699, in _detach_volume
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner check_interval=CONF.benchmark.nova_detach_volume_poll_interval
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/logging.py", line 236, in wrapper
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner return f(*args, **kwargs)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/task/utils.py", line 147, in wait_for
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner check_interval=check_interval)
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner File "/opt/stack/.venv/lib/python2.7/site-packages/rally/task/utils.py", line 246, in wait_for_status
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner resource_status=get_status(resource))
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner TimeoutException: Rally tired waiting for Volume s_rally_d404c8e8_SJMMPWIQ:85c29a5c-a719-43d1-9a62-57fa5037bb01 to become ('AVAILABLE') current status IN-USE
2016-05-20 11:39:46.433 21370 ERROR rally.task.runner
2016-05-20 11:39:46.438 21370 INFO rally.task.runner [-] Task d404c8e8-bd17-45de-9997-f0255c5a70c5 | ITER: 1 END: Error TimeoutException: Rally tired waiting for Volume s_rally_d404c8e8_SJMMPWIQ:85c29a5c-a719-43d1-9a62-57fa5037bb01 to become ('AVAILABLE') current status IN-USE

Back to nova-compute log.
Terminating instance
2016-05-20 13:52:57.383 28838 INFO nova.compute.manager [req-ecc75281-c974-4c8c-a4ca-d9a3a4f56936 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Terminating instance
2016-05-20 13:53:07.598 28838 INFO nova.virt.libvirt.driver [-] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Instance destroyed successfully.
2016-05-20 13:53:07.598 28838 INFO nova.virt.libvirt.driver [-] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Instance destroyed successfully.
2016-05-20 13:53:27.259 28838 INFO nova.compute.manager [-] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] VM Stopped (Lifecycle Event)

Deleting instance:
2016-05-20 13:54:21.226 28838 INFO nova.virt.libvirt.driver [req-ecc75281-c974-4c8c-a4ca-d9a3a4f56936 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Deleting instance files /var/lib/nova/instances/92e8cec9-8e8f-4089-ac44-5b661c3fc327_del
2016-05-20 13:54:21.228 28838 INFO nova.virt.libvirt.driver [req-ecc75281-c974-4c8c-a4ca-d9a3a4f56936 51adf8b5919b437c87c375f25c6b31f8 03b73860b54545eaa01b6c9af607597f - - -] [instance: 92e8cec9-8e8f-4089-ac44-5b661c3fc327] Deletion of /var/lib/nova/instances/92e8cec9-8e8f-4089-ac44-5b661c3fc327_del complete

Steps to reproduce:
 Start rally test CinderVolumes.create_and_attach_volume during the load.
Expected results:
 Test passed. The was volume atteched and detached successfully.
Actual result:
 Test failed. The volume was attached and wasnt detached.
Reproducibility:
 Iterations: 6, Failures: 6
Workaround:
 n/a
Impact:
 unknown
Description of the environment:
 * 50 baremetal nodes:
    - CPU: 12 x 2.10 GHz
    - Disks: 2 drives (SSD - 80 GB, HDD - 931.5 GB), 1006.0 GB total
    - Memory: 2 x 16.0 GB, 32.0 GB total
    - NUMA topology: 1 NUMA node
 * Node roles:
   - 3 controllers (1 was is offline because of disk problems)
   - 44 computes
   - 3 Ceph OSD
 * Details:
   - OS on controllers: Mitaka on Ubuntu 14.04
   - OS on computes: RHEL
   - Compute: KVM
   - Neutron with VLAN segmentation
   - Ceph RBD for volumes (Cinder)
   - Ceph RadosGW for objects (Swift API)
   - Ceph RBD for ephemeral volumes (Nova)
   - Ceph RBD for images (Glance)

Additional information:
 Nova compute logs download link: http://mos-scale-share.mirantis.com/pshvetsov/nova-compute.log-20160521
 Rally logs download link: http://mos-scale-share.mirantis.com/pshvetsov/cinder_create-and-attach-volume-rally-stdout.log
 All computes logs download link: http://mos-scale-share.mirantis.com/small_logs.tar.bz2
 Diagnostic snapshot (controllers + Ceph only) download link: http://mos-scale-share.mirantis.com/fuel-snapshot-2016-05-25_12-15-36.tar.xz

tags: added: long-haul-testing
Changed in mos:
milestone: none → 9.0
Dina Belova (dbelova)
tags: added: area-cinder
Changed in mos:
assignee: nobody → MOS Cinder (mos-cinder)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

According to nova-compute logs:
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall Traceback (most recent call last):
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 377, in _func
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall return self._sleep_time
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall self.force_reraise()
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 356, in _func
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall result = f(*args, **kwargs)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 342, in _do_wait_and_retry_detach
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall reason=reason)
2016-05-20 13:38:07.213 28838 ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain.)

It looks like a nova issue

tags: added: area-nova
Changed in mos:
assignee: MOS Cinder (mos-cinder) → MOS Nova (mos-nova)
tags: removed: area-cinder
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Pavel, could you please clarify what version of libvirt was used? What image did you use for booting VMs?

I checked the attached logs ( http://mos-scale-share.mirantis.com/small_logs.tar.bz2 ), but unfortunately these won't be enough, as we also need libvirtd and qemu logs: just based on the traceback in Nova logs nova-compute timed out waiting for a disk to be detached from a libvirt domain.

Could you please attach logs from /var/log/libvirtd directory on the affected compute nodes?

Note, that we already retry the detach call multiple times in Nova - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/guest.py#L299-L346 (waiting up to 30s), so it's not clear why qemu did not manage to detach the device within that time.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

At this point I believe this has little to do with long haul testing and more looks like a race condition in libvirt or qemu.

Revision history for this message
Pavel (pshvetsov) wrote :

Hello.

libvirt version was 1.2.17, package: 13.el7_2.4
qemu version: 1.5.3

Unfortunately we haven't libvirt logs for this node.

Revision history for this message
Dina Belova (dbelova) wrote :

More than a month in the incomplete state -> marking as invalid. Please move back to confirmed if this will be reproduced again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.