Comment 1 for bug 1821755

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: live migration break the anti-affinity policy of server group simultaneously

This is a long-standing known issue I believe, same for server build and evacuate (evacuate was fixed later in Rocky I think). There is a late affinity check in the compute service to check for the race in the scheduler and then reschedule for server create to another host, or fail in the case of evacuate. There is no such late affinity check for other move operations like live migration, cold migration (resize) or unshelve.

I believe StarlingX's nova fork has some server group checks in the live migration task though, so maybe those fixes could be 'upstreamed' to nova:

https://github.com/starlingx-staging/stx-nova/blob/3155137b8a0f00cfdc534e428037e1a06e98b871/nova/conductor/tasks/live_migrate.py#L88

Looking at that StarlingX code, they basically check to see if the server being live migrated is in an anti-affinity group and if so they restrict scheduling via external lock to one live migration at a time, which might be OK in a small edge node with 1-2 compute nodes but would be pretty severe in a large public cloud with lots of concurrent live migrations. Granted it's only the scheduling portion of the live migration task, not the actual live migration of the guest itself once a target host is selected. I'm also not sure if that external lock would be sufficient if you have multiple nova-conductors running on different hosts unless you were using a distributed lock manager like etcd, which nova upstream does not use (I'm not sure if oslo.concurrency can be configured for etcd under the covers or not).

Long-term this should all be resolved with placement when we can model affinity and anti-affinity in the placement service.