Nova and Cinder get desynced on volume attachments
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Expired
|
High
|
Unassigned |
Bug Description
This bug occurred with versions 2015.1.0 of Nova and Cinder installed.
When bulk deleting large numbers of Nova instances, we occasionally encounter something that appears to be a race condition where Cinder believes a volume is detached and available but Nova reports the volume is still attached to an instance.
Example below shows an instance in ERROR state (some output of nova show truncated for brevity) and Cinder showing that the volume Nova thinks is attached is available:
+------
| Property | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | error |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2015-09-
| flavor | m1.small (2) |
| hostId | e2d6789d6505366
| id | 247d9ebe-
| image | Red Hat Enterprise Linux 7 (current) (297e1979-
| key_name | --- |
| metadata | {} |
| name | r_img_test_z2_1 |
| os-extended-
| security_groups | default |
| status | ERROR |
| tenant_id | a9421c18f6fc48f
| updated | 2015-09-
| user_id | --- |
+------
+------
| ID | Tenant ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+------
| 7adab934-
+------
Attempting to delete the instance while in this state results in the following traceback from nova-client:
File \"/usr/
return function(self, context, *args, **kwargs)
File \"/usr/
do_terminate_
File \"/usr/
return f(*args, **kwargs)
File \"/usr/
self.
File \"/usr/
six.reraise(
File \"/usr/
self.
File \"/usr/
rv = f(*args, **kwargs)
File \"/usr/
quotas.rollback()
File \"/usr/
six.reraise(
File \"/usr/
self.
File \"/usr/
self.
File \"/usr/
res = method(self, ctx, volume_id, *args, **kwargs)
File \"/usr/
cinderclient(
File \"/usr/
return self._action(
File \"/usr/
return self.api.
File \"/usr/
return self._cs_
File \"/usr/
return self.request(url, method, **kwargs)
File \"/usr/
return super(SessionCl
File \"/usr/
resp = super(LegacyJso
File \"/usr/
return self.session.
File \"/usr/
return func(*args, **kwargs)
File \"/usr/
raise exceptions.
Cinder logs an exception stating that the volume has no attachments:
2015-09-23 09:49:47.923 42963 ERROR cinder.
2015-09-23 09:49:47.924 42963 ERROR oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
2015-09-23 09:49:47.924 42963 TRACE oslo_messaging.
While in this state, Nova refuses to take any action on the instance, including force deletion. Removing the volume via cinder delete unsticks things; nova delete will remove the instance once the offending volume has been removed.
tags: | added: compute volumes |
Changed in nova: | |
assignee: | nobody → gundarapu kalyan reddy (gundarapu-reddy) |
Changed in nova: | |
assignee: | gundarapu kalyan reddy (gundarapu-reddy) → nobody |
Changed in nova: | |
status: | New → Confirmed |
Changed in nova: | |
status: | Confirmed → Incomplete |
Yes, there is a race in this code path. Nikola had expressed an interest in removing the volume state checks in Nova (i.e. check_attach, check_attached, check_detached) which would help eliminate this race. We'll also need some "try:" blocks in the Nova detach_volume code, and there is work in Cinder to use compare-and-swap on DB transactions to close the race on the Cinder side. There's work planned in the longer time frame for changes to Cinder API and how Nova consumes it that should help as well.