race condition on port binding vs instance being resumed for live-migrations

Bug #1901707 reported by Tobias Urdin on 2020-10-27
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
sean mooney
neutron
Undecided
Rodolfo Alonso

Bug Description

This is a separation from the discussion in this bug https://bugs.launchpad.net/neutron/+bug/1815989

There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in
detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0

There is a race condition where nova live-migration will wait for neutron to send the network-vif-plugged event but when nova receives that event the live migration is faster than the OVS l2 agent can bind the port on the destination compute node.

This causes the RARP frames sent out to update the switches ARP tables to fail causing the instance to be completely unaccessible after a live migration unless these RARP frames are sent again or traffic is initiated egress from the instance.

See Sean's comments after for the view from the Nova side. The correct behavior should be that the port is ready for use when nova get's the external event, but maybe that is not possible from the neutron side, again see comments in the other bug.

Bence Romsics (bence-romsics) wrote :

I started digesting the linked bug and stuff referred from there and I find it surprisingly complex. Could we re-state the piece of the problem you want to separate here, to help somebody take this bug? I don't want to be dense, but I usually find that re-stating a problem in shorter, simpler way helps solving it.

Is this problem present on master?

Is this dependent on nova using the multiple bindings feature? (I guess yes, because the nova side of that was merged in rocky.)

Is this specific to who plugs the port on the destination host: libvirt and/or os-vif? If yes, which one?

Could we have steps to reproduce this? I get this a race, so the reproduction probably won't be 100%. I also get firewall_driver=iptables_hybrid and live_migration_wait_for_vif_plug=true (default value) is needed. Is there anything else needed to reproduce this bug?

For what it's worth these are the current triggers for neutron to send os-server-external-events to nova:
https://opendev.org/openstack/neutron/src/commit/cbaa328f2ba80ba0af33f43887a040cdd08e508b/neutron/notifiers/nova.py#L102-L103

I believe the first (and currently only) notification neutron sends is needed and used, so we should not change whether or when that is sent. Is this understanding correct?

Do you believe there should be a 2nd notification sent from neutron to nova? If yes, at what time (triggered by what) should it be sent?

Changed in neutron:
status: New → Incomplete
Tobias Urdin (tobias-urdin) wrote :

I don't have a way to test this with any other version than Train right now, this was not an issue on CentOS 7 with Train but when we moved to CentOS 8 with Train this started happening.

What I understand from Sean's input is that the behavior has changed in Neutron, before Neutron would allow two ports to be active so the new port on the compute node would already be ready but now with multiple bindings feature that is not the case anymore.

It's the plugging in openvswitch that is the issue, the port managed by neutron's openvswitch-agent.

IMO there should be an event sent to Nova when the port is fully ready so that Nova could do the live migration after that, but given that the behavior has changed in Neutron maybe it's no longer possible or
allowed to have two ports configured and active.

I can reproduce this 100% of the time with the versions mentioned, the other bug is primarily about another bug which is when the openvswitch firewall driver is used, this is when iptables_hybrid is used but that doesn't seem to be the cause of the issue either way.

I don't have a good way to go about it, since if Sean's comment about it being a behavior change in Neutron that might not be able to workaround there isn't much Nova can do. This pretty much breaks the whole purpose of live-migration since we need to carry a custom patch in Nova that makes the VM send out new RARP frames AFTER the live migration (data plane is therefore dependent on the timings of the control plane running the post_live_migration action in Nova) so we are taking a hit with some second(s) of downtime extra.

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Download full text (3.7 KiB)

Hi:

I detected this problem too. The main problem we have in Neutron is that the "neutron-vif-plugged" event is sent in many situations: when a port is provisioned by the DHCP agent, when the port is bound by the L2 agent or when the port passes from status DOWN to ACTIVE.

For example, when a port is detected by a OVS agent, it binds it to this host and the sends to the server (via RPC) a "update_device_list". The Neutron server receives this list and updates the port status, calling "update_device_up". That calls "update_port_status_to_active" [1] that triggers the port provisioning. This is catched by [2] that updates the port status to ACTIVE. That triggers the Nova notification.

When the port is live migrated, since [3] (live migration with multiple portbinding), the port can has two port bonding definitions: the source host (SOURCE) and the destination host (DEST).

The SOURCE is, until the migration finishes, active. In the profile (a dictionary field), a new key is added: "migration_to", with the name of DEST host.

The DEST is disabled. Is activated when the SOURCE binding is deleted from the port.

A) EXPLANATION OF THE CONNECTIVITY BREAKDOWN
Now, the DEST port is bound to the host when the DEST binding is enabled (as defined in [3]). The problem is that this moment is too late. Nova has already set the ofport of the port (in case of hybrid_plugin=False) because has unpaused the MV in DEST. That means during the time the VM is unpaused and the OVS agent binds the port to the host (sets the OpenFlow rules in OVS), there is

B) EXPLANATION OF THE EVENTS RACE CONDITION
As commented, we are sending the "neutron-vif-plugged" event in many occasions. But this Nova event, at least during the live-migration, is meant to be sent only when the DEST port is bound to the host. That means when the OVS agent in DEST creates the OpenFlow rules and leaves the port ready to be used. **This happens now by pure chance**: when the port is migrated, the port bindings are first deleted and then updated [4]. That means the port is set to DOWN and then activated again (--> that triggers the first "network-vif-plugged" event). Nova reads this event and unpauses the VM in DEST. So just the opposite as it should be.

There are also other triggers that can send the "network-vif-plugged" event, in any other.
1) When the port binding is updated (with the two hosts, SOURCE and DEST), the port is provisioned again by the DHCP agent. This can send this event.
2) When the port binding is updated (first clear and then set again), the SOURCE OVS agent can read both changes in different polling cycles. That will unbind first the port, seinding an update to the server, that will send a "network-vif-unplugged" event. Then, the port is bound again, that will trigger a "network-vif-plugged" event.

During the live-migration:
1) We need to catch those events not generated by the OVS SOURCE agent and dismiss them.
2) We need to bind the port to SOURCE **before** the port activation (please read B). Nova is activating the port because other processes are sending the plugged event, but should be the SOURCE binding process the only one sending it.

I'm pushing https://review...

Read more...

Changed in neutron:
status: Incomplete → In Progress
sean mooney (sean-k-mooney) wrote :

adding nova as there is a nova element that need to be fixed also.

because nova was observing the network-vif-plugged event form the dhcp agent we were not filtinging our wait condition on live migrate to only wait for backend that had plugtime events.

so once this is fixed by rodolfos patch it actully breaks live migration because we are waiting for an event that will never come until https://review.opendev.org/c/openstack/nova/+/602432 is merged.

for backporting reasons i am working in a seperate trivial patch to only wait for backends that send plugtime event. that patch will be backported first allowing rodolfos patch to be backported before https://review.opendev.org/c/openstack/nova/+/602432

i have 1 unit test left to update in the plug time patch and then ill push it and reference this bug.

Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → sean mooney (sean-k-mooney)
sean mooney (sean-k-mooney) wrote :

https://review.opendev.org/c/openstack/nova/+/767368 is the nova bugfix to only wait for plugtime events. rodolfo you should be able to add that via depends on in https://review.opendev.org/c/openstack/neutron/+/766277 to get the live migration tests to pass.

Changed in nova:
status: Triaged → In Progress

Thanks a lot, Sean. Changed Neutron 766277 to depend on Nova 767368.

Tobias Urdin (tobias-urdin) wrote :

Can this be considered solved now that both above mentioned patches is merged?

I don't know if [1] should be considered in this bug.

[1]https://review.opendev.org/c/openstack/nova/+/602432

sean mooney (sean-k-mooney) wrote :

 https://bugs.launchpad.net/neutron/+bug/1815989 is not fully solved until https://review.opendev.org/c/openstack/nova/+/602432 is merged. the neutron part of this bug is solved so i guess we could close this but really we shoudl list which release we plan to adress and use this to track some of the backports around this.

i think we are currenlty up to 6 patchs between nova and neutonrn posibly 8 to adress all aspecsts of both bugs that we need to backport. there are some ordering depencies too so we will have to asses that when determing if we can backport these change resonably.

without https://review.opendev.org/c/openstack/nova/+/602432 the libvirt/neutron race still exists so you may or may not see an imporment in the RARP behavior but the nuetuon race is now resolved if you enabel the neutron config option.

tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers