Comment 2 for bug 698336

Revision history for this message
Brian Lamar (blamar) wrote :

Currently it seems like the compute manager will throw away VMs which are in a failed state, which is even worse than what you described a month ago.

The '_poll_instance_states' method in nova/compute/manager.py will set the power_state of any VM which:
    +has a database entry
    +does not have a VM instance
...to power_state.SHUTOFF.

The side effect of this is that every time compute manager polls for instance status it will actually set all power_state.FAILED VMs to power_state.SHUTOFF in the database and then call self.db.instance_destroy, effectively removing all history of the failed VM.

In an async system like nova, the failed VM should stick around IMO until it has been explicitly deleted. No automatic retries or anything else for the time being. That way a scheduler can get the 202 Accepted response, and then poll GET /servers/<id> until it sees a successful/error status.

What my branch/fix covers:
    1) Ensures '_poll_instance_states' does not clear out VMs which didn't get spawned correctly
    2) Prettifies libvirt createXML error

What it doesn't cover:
    1) A method of getting the error back to the user/scheduler. You'll be able to see the VM failed, but without looking at the logs it's going to be impossible to tell why. This should be covered by the 'error-codes' blueprint which will be hopefully coming along for Diablo.
    2) Automated tests for a scenario such as this