Comment 5 for bug 1302774

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Looking at this a bit more, this seems like a nova race. The fact that detach_call gets to the compute service when the volume is in the wrong state (what the stack trace tells us) makes me think that.

Looking deeper, this seems to be a race condition in Nova API, which is actually caused by the nova cinder API class.

This is probably from the time when Cinder was being split out.

Basically instead of actually sending a cinder check_detach API request, nova will just check the status of the volume dict passed into the cinder.API.check_detach. In case of ec2 API we first get the volume, and then get it's instance since ec2 API does not expect the instance to be specified (see nova.api.ec2.cloud.CloudController.detach_volume). This will then hit the database, which is serialized due to eventlet in the gate... thus making the race even bigger.

An easy way to reproduce the usse would be to make nova.compute.api.API.detach_volume artificially take longer for one request and then fire off two requests. Both will get to the compute host (race!), and one will stacktrace.

I will submit a nova patch as soon as I figure out the proper checks that need to be replaced by actual API calls.