I was just looking at this bug again and I think I have a theory how this happened, based on code inspection and a recent IRC discussion with mriedem and dansmith about how instances end up in cell0 without their instance mapping cell_id set to cell0.
There was mention that if the _set_vm_state_and_notify call after instance.create() fails for any reason, we would never update the instance mapping and would also not delete the build request:
Though, looking at set_vm_state_and_notify, I'm not sure how we could fail before the ERROR state update, unless it's possible for rpc.get_notifier to raise in the event of MQ trouble? Maybe we can ask gibi.
So, if set_vm_state_and_notify fails, we won't put the instance into error state, the instance mapping will have cell_id = NULL, and the build request will still be around for the same instance.
THEN, when a user does a server list, both build requests and instances will be obtained from the databases (there will be one build request and one instance record in cell0 for an instance that failed a set_vm_state_and_notify), then the API tries to access the flavor attribute for each item in the list, and the flavor does *not* exist on a build request, so the lazy-load of the flavor fails and the server list fails with the 500 error.
I was just looking at this bug again and I think I have a theory how this happened, based on code inspection and a recent IRC discussion with mriedem and dansmith about how instances end up in cell0 without their instance mapping cell_id set to cell0.
There was mention that if the _set_vm_ state_and_ notify call after instance.create() fails for any reason, we would never update the instance mapping and would also not delete the build request:
https:/ /github. com/openstack/ nova/blob/ 125dd1f30fdaf50 182256c56808a51 99856383c7/ nova/conductor/ manager. py#L849
Though, looking at set_vm_ state_and_ notify, I'm not sure how we could fail before the ERROR state update, unless it's possible for rpc.get_notifier to raise in the event of MQ trouble? Maybe we can ask gibi.
https:/ /github. com/openstack/ nova/blob/ 125dd1f30fdaf50 182256c56808a51 99856383c7/ nova/scheduler/ utils.py# L95
https:/ /github. com/openstack/ nova/blob/ 125dd1f30fdaf50 182256c56808a51 99856383c7/ nova/rpc. py#L216
So, if set_vm_ state_and_ notify fails, we won't put the instance into error state, the instance mapping will have cell_id = NULL, and the build request will still be around for the same instance.
THEN, when a user does a server list, both build requests and instances will be obtained from the databases (there will be one build request and one instance record in cell0 for an instance that failed a set_vm_ state_and_ notify) , then the API tries to access the flavor attribute for each item in the list, and the flavor does *not* exist on a build request, so the lazy-load of the flavor fails and the server list fails with the 500 error.