Volume transition from attaching -> attached is lost, leading to unattached and undettachable volume

Bug #1254972 reported by Michael Still
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Unassigned
OpenStack Compute (nova)
Fix Released
High
Unassigned

Bug Description

http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIG1lc3NhZ2U6XCJWb2x1bWUgc3RhdHVzIG11c3QgYmUgYXZhaWxhYmxlIG9yIGVycm9yLCBidXQgY3VycmVudCBzdGF0dXMgaXM6IGF0dGFjaGluZ1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg1NDQyOTg1OTc0fQ==

Not super common, but this caused at least two failures in merges today.

From console.html:

2013-11-26 04:33:12.467 | ======================================================================
2013-11-26 04:33:12.467 | FAIL: tearDownClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON)
2013-11-26 04:33:12.467 | tearDownClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON)
2013-11-26 04:33:12.467 | ----------------------------------------------------------------------
2013-11-26 04:33:12.467 | _StringException: Traceback (most recent call last):
2013-11-26 04:33:12.468 | File "tempest/api/compute/servers/test_server_rescue.py", line 85, in tearDownClass
2013-11-26 04:33:12.468 | client.delete_volume(str(cls.volume_to_detach['id']).strip())
2013-11-26 04:33:12.468 | File "tempest/services/compute/json/volumes_extensions_client.py", line 84, in delete_volume
2013-11-26 04:33:12.468 | return self.delete("os-volumes/%s" % str(volume_id))
2013-11-26 04:33:12.468 | File "tempest/common/rest_client.py", line 308, in delete
2013-11-26 04:33:12.469 | return self.request('DELETE', url, headers)
2013-11-26 04:33:12.469 | File "tempest/common/rest_client.py", line 436, in request
2013-11-26 04:33:12.469 | resp, resp_body)
2013-11-26 04:33:12.469 | File "tempest/common/rest_client.py", line 486, in _error_checker
2013-11-26 04:33:12.469 | raise exceptions.BadRequest(resp_body)
2013-11-26 04:33:12.469 | BadRequest: Bad request
2013-11-26 04:33:12.470 | Details: {u'badRequest': {u'message': u'Invalid input received: Invalid volume: Volume status must be available or error, but current status is: attaching', u'code': 400}}
2013-11-26 04:33:12.470 |
2013-11-26 04:33:12.470 |

Unfortunately I can't find more in the cinder logs because tempest doesn't log the volume id.

Revision history for this message
John Griffith (john-griffith) wrote :

Volume attach is actually a compute task. On the Cinder side we simply wait for Nova to make it's request and update the status in the DB. I'll dig in but I suspect we need to look at compute on this.

tags: added: gate-failure
affects: cinder → nova
Revision history for this message
John Griffith (john-griffith) wrote :

oops... meant to change that to also affects, I'll adjust

Revision history for this message
John Griffith (john-griffith) wrote :

So the stuck in attaching is bothersome, like I said though the problem I have here is that this is stuck waiting for Nova to send the update I believe.

The first thought I had was that we should add the ability to send a force-delete from Nova, however, it occured to me, do we want to continue to support the volume api calls in Nova at all? This particular call is not something that Nova actually needs, this an quite a few others are just still there as backward compat from when we switched from Nova-Volumes. We kept the functionality of the novaclient to do volume-xxxx.

This is actually might be a bit of an issue as most of us on the Cinder side aren't updating/maintaining the volume code in novaclient. Not sure if we want to remove this extension or just update it and modify the tempest test to use a force-delete maybe?

Revision history for this message
John Griffith (john-griffith) wrote :

Not seeing these in logstash since the 23'rd (load issue?).

I suspect it may also have been related to the mount timeout in Nova that we addressed earlier. We should keep it open for now, but if somebody sees this again we need the link to the actual test failure.

The method of submitting the queries is catching multiple results making it kinda confusing to try and debug/triage these.

Revision history for this message
Sean Dague (sdague) wrote :

Current logstash hits for this: http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkludmFsaWQgaW5wdXQgcmVjZWl2ZWQ6IEludmFsaWQgdm9sdW1lOiBWb2x1bWUgc3RhdHVzIG11c3QgYmUgYXZhaWxhYmxlIG9yIGVycm9yLCBidXQgY3VycmVudCBzdGF0dXMgaXM6IGF0dGFjaGluZ1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzk4NjIzNzk5NDYyfQ==

What it looks like is there is a race condition in completing the attachment between Nova and Cinder, not entirely unlike the issue seen between neutron and nova with network attachment. Nikola and I found this in the ec2 paths recently as well.

I realistically believe this is probably one of the key root issues in why the volumes tests end up somewhat unreliable. I suspect that if we really cranked up the tests in this area we'd get a 100% failure rate over these kinds of races.

Changed in cinder:
status: New → Confirmed
importance: Undecided → High
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Sean Dague (sdague)
summary: - Tempest failure due to volume attach being slow
+ Volume transition from attaching -> attached is lost, leading to
+ unattached and undettachable volume
Revision history for this message
Matt Riedemann (mriedem) wrote :

Hmm, we have an e-r query for this but it didn't report on a patch of mine that failed with this.

Looking at the e-r status page though it appears to be hitting on other patches so must have just been a hiccup.

tags: added: testing volumes
Revision history for this message
Joe Gordon (jogo) wrote :

no hits in a while looks this was resolved

Changed in nova:
status: Confirmed → Fix Committed
Changed in cinder:
status: Confirmed → Fix Committed
Thierry Carrez (ttx)
Changed in cinder:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-3 → 2014.2
Thierry Carrez (ttx)
Changed in cinder:
milestone: juno-3 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.