race condition on port binding vs instance being resumed for live-migrations

Bug #1901707 reported by Tobias Urdin
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
sean mooney
Stein
New
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Unassigned
neutron
Fix Released
Undecided
Unassigned

Bug Description

This is a separation from the discussion in this bug https://bugs.launchpad.net/neutron/+bug/1815989

There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in
detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0

There is a race condition where nova live-migration will wait for neutron to send the network-vif-plugged event but when nova receives that event the live migration is faster than the OVS l2 agent can bind the port on the destination compute node.

This causes the RARP frames sent out to update the switches ARP tables to fail causing the instance to be completely unaccessible after a live migration unless these RARP frames are sent again or traffic is initiated egress from the instance.

See Sean's comments after for the view from the Nova side. The correct behavior should be that the port is ready for use when nova get's the external event, but maybe that is not possible from the neutron side, again see comments in the other bug.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I started digesting the linked bug and stuff referred from there and I find it surprisingly complex. Could we re-state the piece of the problem you want to separate here, to help somebody take this bug? I don't want to be dense, but I usually find that re-stating a problem in shorter, simpler way helps solving it.

Is this problem present on master?

Is this dependent on nova using the multiple bindings feature? (I guess yes, because the nova side of that was merged in rocky.)

Is this specific to who plugs the port on the destination host: libvirt and/or os-vif? If yes, which one?

Could we have steps to reproduce this? I get this a race, so the reproduction probably won't be 100%. I also get firewall_driver=iptables_hybrid and live_migration_wait_for_vif_plug=true (default value) is needed. Is there anything else needed to reproduce this bug?

For what it's worth these are the current triggers for neutron to send os-server-external-events to nova:
https://opendev.org/openstack/neutron/src/commit/cbaa328f2ba80ba0af33f43887a040cdd08e508b/neutron/notifiers/nova.py#L102-L103

I believe the first (and currently only) notification neutron sends is needed and used, so we should not change whether or when that is sent. Is this understanding correct?

Do you believe there should be a 2nd notification sent from neutron to nova? If yes, at what time (triggered by what) should it be sent?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

I don't have a way to test this with any other version than Train right now, this was not an issue on CentOS 7 with Train but when we moved to CentOS 8 with Train this started happening.

What I understand from Sean's input is that the behavior has changed in Neutron, before Neutron would allow two ports to be active so the new port on the compute node would already be ready but now with multiple bindings feature that is not the case anymore.

It's the plugging in openvswitch that is the issue, the port managed by neutron's openvswitch-agent.

IMO there should be an event sent to Nova when the port is fully ready so that Nova could do the live migration after that, but given that the behavior has changed in Neutron maybe it's no longer possible or
allowed to have two ports configured and active.

I can reproduce this 100% of the time with the versions mentioned, the other bug is primarily about another bug which is when the openvswitch firewall driver is used, this is when iptables_hybrid is used but that doesn't seem to be the cause of the issue either way.

I don't have a good way to go about it, since if Sean's comment about it being a behavior change in Neutron that might not be able to workaround there isn't much Nova can do. This pretty much breaks the whole purpose of live-migration since we need to carry a custom patch in Nova that makes the VM send out new RARP frames AFTER the live migration (data plane is therefore dependent on the timings of the control plane running the post_live_migration action in Nova) so we are taking a hit with some second(s) of downtime extra.

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Download full text (3.7 KiB)

Hi:

I detected this problem too. The main problem we have in Neutron is that the "neutron-vif-plugged" event is sent in many situations: when a port is provisioned by the DHCP agent, when the port is bound by the L2 agent or when the port passes from status DOWN to ACTIVE.

For example, when a port is detected by a OVS agent, it binds it to this host and the sends to the server (via RPC) a "update_device_list". The Neutron server receives this list and updates the port status, calling "update_device_up". That calls "update_port_status_to_active" [1] that triggers the port provisioning. This is catched by [2] that updates the port status to ACTIVE. That triggers the Nova notification.

When the port is live migrated, since [3] (live migration with multiple portbinding), the port can has two port bonding definitions: the source host (SOURCE) and the destination host (DEST).

The SOURCE is, until the migration finishes, active. In the profile (a dictionary field), a new key is added: "migration_to", with the name of DEST host.

The DEST is disabled. Is activated when the SOURCE binding is deleted from the port.

A) EXPLANATION OF THE CONNECTIVITY BREAKDOWN
Now, the DEST port is bound to the host when the DEST binding is enabled (as defined in [3]). The problem is that this moment is too late. Nova has already set the ofport of the port (in case of hybrid_plugin=False) because has unpaused the MV in DEST. That means during the time the VM is unpaused and the OVS agent binds the port to the host (sets the OpenFlow rules in OVS), there is

B) EXPLANATION OF THE EVENTS RACE CONDITION
As commented, we are sending the "neutron-vif-plugged" event in many occasions. But this Nova event, at least during the live-migration, is meant to be sent only when the DEST port is bound to the host. That means when the OVS agent in DEST creates the OpenFlow rules and leaves the port ready to be used. **This happens now by pure chance**: when the port is migrated, the port bindings are first deleted and then updated [4]. That means the port is set to DOWN and then activated again (--> that triggers the first "network-vif-plugged" event). Nova reads this event and unpauses the VM in DEST. So just the opposite as it should be.

There are also other triggers that can send the "network-vif-plugged" event, in any other.
1) When the port binding is updated (with the two hosts, SOURCE and DEST), the port is provisioned again by the DHCP agent. This can send this event.
2) When the port binding is updated (first clear and then set again), the SOURCE OVS agent can read both changes in different polling cycles. That will unbind first the port, seinding an update to the server, that will send a "network-vif-unplugged" event. Then, the port is bound again, that will trigger a "network-vif-plugged" event.

During the live-migration:
1) We need to catch those events not generated by the OVS SOURCE agent and dismiss them.
2) We need to bind the port to SOURCE **before** the port activation (please read B). Nova is activating the port because other processes are sending the plugged event, but should be the SOURCE binding process the only one sending it.

I'm pushing https://review...

Read more...

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

adding nova as there is a nova element that need to be fixed also.

because nova was observing the network-vif-plugged event form the dhcp agent we were not filtinging our wait condition on live migrate to only wait for backend that had plugtime events.

so once this is fixed by rodolfos patch it actully breaks live migration because we are waiting for an event that will never come until https://review.opendev.org/c/openstack/nova/+/602432 is merged.

for backporting reasons i am working in a seperate trivial patch to only wait for backends that send plugtime event. that patch will be backported first allowing rodolfos patch to be backported before https://review.opendev.org/c/openstack/nova/+/602432

i have 1 unit test left to update in the plug time patch and then ill push it and reference this bug.

Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
sean mooney (sean-k-mooney) wrote :

https://review.opendev.org/c/openstack/nova/+/767368 is the nova bugfix to only wait for plugtime events. rodolfo you should be able to add that via depends on in https://review.opendev.org/c/openstack/neutron/+/766277 to get the live migration tests to pass.

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Thanks a lot, Sean. Changed Neutron 766277 to depend on Nova 767368.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Can this be considered solved now that both above mentioned patches is merged?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

I don't know if [1] should be considered in this bug.

[1]https://review.opendev.org/c/openstack/nova/+/602432

Revision history for this message
sean mooney (sean-k-mooney) wrote :

 https://bugs.launchpad.net/neutron/+bug/1815989 is not fully solved until https://review.opendev.org/c/openstack/nova/+/602432 is merged. the neutron part of this bug is solved so i guess we could close this but really we shoudl list which release we plan to adress and use this to track some of the backports around this.

i think we are currenlty up to 6 patchs between nova and neutonrn posibly 8 to adress all aspecsts of both bugs that we need to backport. there are some ordering depencies too so we will have to asses that when determing if we can backport these change resonably.

without https://review.opendev.org/c/openstack/nova/+/602432 the libvirt/neutron race still exists so you may or may not see an imporment in the RARP behavior but the nuetuon race is now resolved if you enabel the neutron config option.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.0

This issue was fixed in the openstack/nova 21.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.2.0

This issue was fixed in the openstack/nova 22.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.0.0.0rc1

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/790702

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/795761

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/790702
Committed: https://opendev.org/openstack/neutron/commit/44847d11ada9f067a0374a349aefefb628df1868
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 44847d11ada9f067a0374a349aefefb628df1868
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Dec 3 11:58:35 2020 +0000

    [OVS] Fix live-migration connection disruption

    The goal of this patch is to avoid the connection disruption during
    the live-migration using OVS. Since [1], when a port is migrated,
    both the source and the destination hosts are added to the profile
    binding information. Initially, the source host binding is activated
    and the destination is deactivated.

    When the port is created in the destination host (created by Nova),
    the port was not configured because the binding was not activated.
    The binding (that means, all the OpenFlow rules) was done when Nova
    sent the port activation. That happend when the VM was already
    running in the destination host. If the OVS agent was loaded, the
    port was bound seconds later to the port activation.

    Instead, this patch enables the OpenFlow rule creation in the
    destination host when the port is created.

    Another problem are the "neutron-vif-plugged" events sent by Neutron
    to Nova to inform about the port binding. Nova is expecting one single
    event informing about the destination port binding. At this moment,
    Nova considers the port is bound and ready to transmit data.

    Several triggers were firing expectedly this event:
    - When the port binding was updated, the port is set to down and then
      up again, forcing this event.
    - When the port binding was updated, first the binding is deleted and
      then updated with the new information. That triggers in the source
      host to set the port down and the up again, sending the event.

    This patch removes those events, sending the "neutron-vif-plugged"
    event only when the port is bound to the destination host (and as
    commented before, this is happening now regardless of the binding
    activation status).

    This feature depends on [2]. If this Nova patch is not in place, Nova
    will never plug the port in the destination host and Neutron won't be
    able to send the vif-plugged event to Nova to finish the
    live-migration process.

    Because from Neutron cannot query Nova to know if this patch is in
    place, a new temporary configuration option has been created to enable
    this feature. The default value will be "False"; that means Neutron
    will behave as before.

    [1]https://bugs.launchpad.net/neutron/+bug/1580880
    [2]https://review.opendev.org/c/openstack/nova/+/767368

    Closes-Bug: #1901707

    Conflicts:
          zuul.d/tempest-multinode.yaml

    Change-Id: Iee323943ac66e566e5a5e92de1861832e86fc7fc
    (cherry picked from commit f8a22c7d4aa654eaad3b683073849c873ea3beff)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/795761
Committed: https://opendev.org/openstack/neutron/commit/48af145d95dcc50897b504a8bbff44c3906a112b
Submitter: "Zuul (22348)"
Branch: master

commit 48af145d95dcc50897b504a8bbff44c3906a112b
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Jun 10 11:13:10 2021 +0000

    Add "nova:live_migration_events" flag to subnode in multinode CI job

    Added "nova:live_migration_events" flag to subnode neutron.conf file in
    "neutron-tempest-multinode-full-py3" CI job. That flag was missing in
    the patch implementing this feature [1].

    [1]https://review.opendev.org/c/openstack/neutron/+/766277

    Change-Id: Idc938a1dc9de3ad77f558df4f4fb4ae5c3de3d21
    Related-Bug: #1901707

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.2.0

This issue was fixed in the openstack/neutron 17.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/770844
Committed: https://opendev.org/openstack/nova/commit/c0a36d917794fed77e75ba9ed853c01a77b540bd
Submitter: "Zuul (22348)"
Branch: stable/train

commit c0a36d917794fed77e75ba9ed853c01a77b540bd
Author: Sean Mooney <email address hidden>
Date: Wed Dec 16 13:12:13 2020 +0000

    only wait for plugtime events in pre-live-migration

    This change modifies _get_neutron_events_for_live_migration
    to filter the event to just the subset that will be sent
    at plug-time.

    Currently neuton has a bug where by the dhcp agent
    send a network-vif-plugged event during live migration after
    we update the port profile with "migrating-to:"
    this cause a network-vif-plugged event to be sent for
    configuration where vif_plugging in nova/os-vif is a noop.

    when that is corrected the current logic in nova cause the migration
    to time out as its waiting for an event that will never arrive.

    This change filters the set of events we wait for to just the plug
    time events.

    Conflicts:
        nova/compute/manager.py
        nova/tests/unit/compute/test_compute_mgr.py

    Related-Bug: #1815989
    Closes-Bug: #1901707
    Change-Id: Id2d8d72d30075200d2b07b847c4e5568599b0d3b
    (cherry picked from commit 8b33ac064456482158b23c2a2d52f819ebb4c60e)
    (cherry picked from commit ef348c4eb3379189f290217c9351157b1ebf0adb)
    (cherry picked from commit d9c833d5a404dfa206e08c97543e80cb613b3f0b)

tags: added: in-stable-train
Changed in neutron:
status: In Progress → Fix Released
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/820682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/821443

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ussuri)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/821443
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/840448

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/840448
Committed: https://opendev.org/openstack/neutron/commit/9025f8a571029ce41d815c2704e29956b31f7f1f
Submitter: "Zuul (22348)"
Branch: master

commit 9025f8a571029ce41d815c2704e29956b31f7f1f
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Sun Apr 24 00:19:25 2022 +0000

    Remove "live_migration_events" configuration option

    This option was introduced in [1]. This option depended on [2],
    the Nova code enabling this feature, that filters the
    "vif-plugged-event" to be sent to Nova.

    Now the default behaviour is "True".

    Related-Bug: #1901707

    [1]https://review.opendev.org/c/openstack/neutron/+/766277
    [2]https://review.opendev.org/c/openstack/nova/+/767368

    Change-Id: I05f7e6a7d91f6a4a1fe6d4765589f30257243628

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/820682
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

王道远 (wangdaoyuan)
Changed in nova:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova train-eol

This issue was fixed in the openstack/nova train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.