Comment 5 for bug 1799152

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i think this is actully another example of https://bugs.launchpad.net/nova/+bug/1788014
in that we had a general problem where we tread all event as success.

i agree that its a bug that we treat it as sucess and then end up deleting the vm but not with the retry. when we get VIR_ERR_OPERATION_INVALID i think we should fail the migration immediately and rollback without retrying.

the fix for https://bugs.launchpad.net/nova/+bug/1788014 has not been backported to queens yet
but if we look at the change https://review.opendev.org/#/c/700774/1/nova/virt/libvirt/host.py
the primary thing we are doing is looking at the detail of the event to determin if the
libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event was a success and signals the completion of the migration or if it was an error.

by the way i assume you are refering to https://github.com/open stack/nova/blob/stable/queens/nova/virt/libvirt/migration.py#L240-L258 when you say "nova thinks the domain is shutdown or gone away, so it happily return JobInfo(type=libvirt.VIR_DOMAIN_JOB_COMPLETED),"

on queens we treat all the VIR_DOMAIN_EVENT_* as sucess and our huristic fo determining if a migration suceeded wont handel the OOM case so we proceed to post live migrate when we should have failed the migration and rolled back. when we recive an invalid operation error form libvirt and called find_job_type we really shoudl end up taking the exception path and return libvirt.VIR_DOMAIN_JOB_FAILED

im not sure if backporting https://review.opendev.org/#/c/700774/1 to queens would also solve this issue but i think that is the direction we shoudl go to adress this.