Comment 2 for bug 1825537

Revision history for this message
Matt Riedemann (mriedem) wrote :

Thinking of options to fix this when it happens:

1. Rather than revert the allocations, we just drop the allocations on the source node (which are held by the migration record since Queens). That's pretty straight-forward I think and should be OK for backports to Rocky and maybe Queens. Pike gets a bit hairy though since the allocations were doubled up on the instance consumer in Pike so we'd have to remove the allocations against the source node resource provider specifically in that case. The code might already handle that in Pike, I haven't looked yet. This does have the downside of leaving residue on the source compute when the resize fails, but that's kind of always been an issue for a failed resize.

2. Change the instance.host/node back to the source host/node if finish_resize fails. I think this is probably not really an option because finish_resize has never done this on failure and we don't really know what state the instance is in, not to mention it's storage (volumes) or networking at that point, so it's likely best to just leave the DB pointing at the dest host/node and the server can be recovered there via hard reboot (or worst case rebuild).

3. An alternative to #2 is somehow trigger a revert resize to get back to the source host. Note that you can't attempt to revert a resize on an instance in ERROR status, which is what the state of the server is in when finish_resize fails. Additionally, the revert_resize API method is looking for a migration record with status 'finished' but the migration status would actually be 'error' when finish_resize fails, so trying to revert the resize from the API wouldn't work with the code as it exists today. The compute manager finish_resize method could call the compute manager revert_resize method (avoiding the API) but it's questionable at best how graceful that would be depending on how finish_resize failed.