Comment 5 for bug 1775934

Revision history for this message
melanie witt (melwitt) wrote :

I was just looking at this bug again and I think I have a theory how this happened, based on code inspection and a recent IRC discussion with mriedem and dansmith about how instances end up in cell0 without their instance mapping cell_id set to cell0.

There was mention that if the _set_vm_state_and_notify call after instance.create() fails for any reason, we would never update the instance mapping and would also not delete the build request:

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/conductor/manager.py#L849

Though, looking at set_vm_state_and_notify, I'm not sure how we could fail before the ERROR state update, unless it's possible for rpc.get_notifier to raise in the event of MQ trouble? Maybe we can ask gibi.

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/scheduler/utils.py#L95

https://github.com/openstack/nova/blob/125dd1f30fdaf50182256c56808a5199856383c7/nova/rpc.py#L216

So, if set_vm_state_and_notify fails, we won't put the instance into error state, the instance mapping will have cell_id = NULL, and the build request will still be around for the same instance.

THEN, when a user does a server list, both build requests and instances will be obtained from the databases (there will be one build request and one instance record in cell0 for an instance that failed a set_vm_state_and_notify), then the API tries to access the flavor attribute for each item in the list, and the flavor does *not* exist on a build request, so the lazy-load of the flavor fails and the server list fails with the 500 error.