Comment 52 for bug 1815989

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

The race exists for iptables_hybrid based deployments when live_migration_wait_for_vif_plug=true (default value) as well.

The source compute node does the pre live migration and waits for the network-vif-plugged which it receives, the live migration starts, the VM is resumed on destination but the destination compute node binds the port AFTER the instance has already been resumed.

To understand the race issue.

* The pre live migration is complete when source compute node gets network-vif-plugged event

2020-10-23 10:39:31.634 3460854 INFO nova.compute.manager [-] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] Took 2.83 seconds for pre_live_migration on destination host compute-02.
2020-10-23 10:39:32.200 3460854 DEBUG nova.compute.manager [req-7f2c3034-c0b4-4e6b-9209-280638dcd2e1 6283ca84a2ff4cc099fcfd8e50550910 3a28d0f6b65a44c2aa1bbffbfa8bb2ea - default default] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] Received event network-vif-plugged-f83f20ad-feff-4369-a752-a81964bcfd52 external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9273

* Then the instance is resumed on the destination compute node

2020-10-23 10:39:35.467 2082170 INFO nova.compute.manager [req-5c20ab33-21eb-48b8-950f-85807ebc1559 - - - - -] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] VM Resumed (Lifecycle Event)

* But the port is not really updated or fixed on the destination compute node until after that

2020-10-23 10:39:37.504 2096718 DEBUG neutron.agent.resource_cache [req-3b8c2e3f-4e62-446b-b9db-f6bf12012ab0 f1fc63f1306549a0b1aba80875aac683 3a28d0f6b65a44c2aa1bbffbfa8bb2ea - - -] Resource Port f83f20ad-feff-4369-a752-a81964bcfd52 updated <a lot of port binding data here>

It also passes another round of fixing the port at

2020-10-23 10:39:37.859 2096718 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-d0d6ab14-b56e-4d91-9030-7f422465f628 - - - - -] Port f83f20ad-feff-4369-a752-a81964bcfd52 updated

and the done line is not until

2020-10-23 10:39:38.572 2096718 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-d0d6ab14-b56e-4d91-9030-7f422465f628 - - - - -] Configuration for devices up ['f83f20ad-feff-4369-a752-a81964bcfd52'] and devices down [] completed.

This means that there is a race of about ~3 seconds there when the instance is resumed vs when the port is bound.

Now the question is, nova is properly waiting for the network-vif-plugged event but that is not really the time when the port is ready, so is there any other event that we should/could wait for or is this a neutron issue in the end?