concurrent calls to _bind_port_if_needed results in port stuck in DOWN status

Bug #1755810 reported by Allain Legacy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Kailun Qin

Bug Description

There is a concurrency issue in _bind_port_if_needed [1] that is leading to a missing RPC notification which in turn results in a port stuck in DOWN status following a live-migration. If update_port [2] is run concurrently with any other API or RPC that needs to call _bind_port_if_needed then it is possible that the _bind_port_if_needed originating from update_port will fail to send an RPC notification because the PortContext does not have any binding levels which causes _notify_port_updated [3] to suppress the notification.

For example, if get_device_details [4] runs concurrently with update_port it will call get_bound_port_context [5] which will call _bind_port_if_needed(notify=False) while update_port will call _bind_port_if_needed(notify=True). If the call made by update_port commits the port binding first then there is no issue, but if the call made by get_device_details finishes first then there is no RPC notification sent to the agent. If the RPC notification is not sent to the agent then the port will remain stuck in the DOWN status until another port update forces the agent to act. If this happens early in the live-migration there is a possibility that the system will auto-correct if another port_update happens, but if the issue happens on the last port update in the live-migration then the port will remain stuck.

A port stuck in the DOWN status has negative effects on consumers of the L2Population functionality because the L2Population mechanism driver will not be triggered to publish that a port is UP on a given compute node.

The issue coincides with the occurrence of this log:

   2018-03-14 11:16:00.429 19987 INFO neutron.plugins.ml2.plugin [req-ef7e51e2-ef99-48f9-bc6f-45684c0bbce4 b9004dd32a07409787d2cf58f30b5fb8 2c45a0a106574a56bff11c3e83c331a6 - default default] Attempt 2 to bind port ea5e524e-e7d4-4fec-a491-11f80f1de4a7

On the first iteration thru _bind_port_if_needed the context returned by _attempt_binding [6] has proper binding levels set on the PortContext, but the subsequence call to _commit_port_binding [7] replaces the PortContext with a new instance which does not have any binding levels set. That new PortContext is returned and used within _bind_port_if_needed during the second iteration. During that second iteration the call to _attempt_binding returns without doing anything because _should_bind_port [6] returns False. _bind_port_if_needed then proceeds to call _notify_port_updated [3] which does nothing due to the missing binding_levels.

This was discovered by our product test group using a simple setup of 2 compute nodes and a single VM that was being live-migrated between the two nodes. The VM was configured with 3 ports. Over ~1000 live migrations this happened between 5 and 10 times and each time caused loss of communication to the VM instance as the agents were not given the latest L2Population data because the port appeared DOWN in the database. Manual intervention was required to set the port admin_state_up=False and then back to True to trigger an RPC notification to the agent to update the port status to UP.

This was observed in stable/pike but looking at the code in master I don't see that it would behave any differently.

[1] plugins.ml2.plugin.Ml2Plugin#_bind_port_if_needed
[2] plugins.ml2.plugin.Ml2Plugin#update_port
[3] plugins.ml2.plugin.Ml2Plugin#_notify_port_updated
[4] plugins.ml2.rpc.RpcCallbacks#get_device_details
[5] plugins.ml2.plugin.Ml2Plugin#get_bound_port_context
[6] plugins.ml2.plugin.Ml2Plugin#_attempt_binding
[7] plugins.ml2.plugin.Ml2Plugin#_commit_port_binding

zhaobo (zhaobo6)
tags: added: api l2-pop needs-attention
Changed in neutron:
assignee: nobody → Rajat Dhasmana (whoami-rajat)
Revision history for this message
Kailun Qin (kailun.qin) wrote :

@Rajat
Hi Rajat, may I know whether you are still working on this issue? If not, I'd like to take over and make a fix for it. :) Let me know if any question or concern. Thanks!

Revision history for this message
Rajat Dhasmana (whoami-rajat) wrote :

Hi Kailun Qin, you can take this bug.

Changed in neutron:
assignee: Rajat Dhasmana (whoami-rajat) → nobody
Kailun Qin (kailun.qin)
Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/606827

Changed in neutron:
status: New → In Progress
Revision history for this message
Kailun Qin (kailun.qin) wrote :

Dear Neutron team,
This bug has not yet been triaged since it was reported. A patch is proposed @ https://review.openstack.org/606827 to address the issue.
Would you please kindly take a look at this one?
Much appreciate it.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Kailun Qin (kailun.qin) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/606827
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/606827
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443
Submitter: Zuul
Branch: master

commit 0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443
Author: Kailun Qin <email address hidden>
Date: Wed May 1 15:50:48 2019 +0800

    Populate binding levels when concurrent ops fail

    Concurrent calls to _bind_port_if_needed may lead to a missing RPC
    notification which can cause a port stuck in a DOWN state. If the only
    caller that succeeds in the concurrency does not specify that an RPC
    notification is allowed then no RPC would be sent to the agent. The
    other caller which needs to send an RPC notification will fail since the
    resulting PortContext instance will not have any binding levels set.

    The failure has negative effects on consumers of the L2Population
    functionality because the L2Population mechanism driver will not be
    triggered to publish that a port is UP on a given compute node. Manual
    intervention is required in this case.

    This patch proposes to handle this by populating the PortContext with
    the current binding levels so that the caller can continue on and have
    an RPC notification sent out.

    Closes-Bug: #1755810
    Story: 2003922
    Change-Id: Ie2b813b2bdf181fb3c24743dbd13487ace6ee76a

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/675877

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/675878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/675879

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/675877
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=287ae43e8f2ee27adc3d7f344a3cbd4c9cf72d6e
Submitter: Zuul
Branch: stable/stein

commit 287ae43e8f2ee27adc3d7f344a3cbd4c9cf72d6e
Author: Kailun Qin <email address hidden>
Date: Wed May 1 15:50:48 2019 +0800

    Populate binding levels when concurrent ops fail

    Concurrent calls to _bind_port_if_needed may lead to a missing RPC
    notification which can cause a port stuck in a DOWN state. If the only
    caller that succeeds in the concurrency does not specify that an RPC
    notification is allowed then no RPC would be sent to the agent. The
    other caller which needs to send an RPC notification will fail since the
    resulting PortContext instance will not have any binding levels set.

    The failure has negative effects on consumers of the L2Population
    functionality because the L2Population mechanism driver will not be
    triggered to publish that a port is UP on a given compute node. Manual
    intervention is required in this case.

    This patch proposes to handle this by populating the PortContext with
    the current binding levels so that the caller can continue on and have
    an RPC notification sent out.

    Closes-Bug: #1755810
    Story: 2003922
    Change-Id: Ie2b813b2bdf181fb3c24743dbd13487ace6ee76a
    (cherry picked from commit 0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/675878
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=09b7b2e1ea6f40cd9e6104f1c5fbef31fc3e0545
Submitter: Zuul
Branch: stable/rocky

commit 09b7b2e1ea6f40cd9e6104f1c5fbef31fc3e0545
Author: Kailun Qin <email address hidden>
Date: Wed May 1 15:50:48 2019 +0800

    Populate binding levels when concurrent ops fail

    Concurrent calls to _bind_port_if_needed may lead to a missing RPC
    notification which can cause a port stuck in a DOWN state. If the only
    caller that succeeds in the concurrency does not specify that an RPC
    notification is allowed then no RPC would be sent to the agent. The
    other caller which needs to send an RPC notification will fail since the
    resulting PortContext instance will not have any binding levels set.

    The failure has negative effects on consumers of the L2Population
    functionality because the L2Population mechanism driver will not be
    triggered to publish that a port is UP on a given compute node. Manual
    intervention is required in this case.

    This patch proposes to handle this by populating the PortContext with
    the current binding levels so that the caller can continue on and have
    an RPC notification sent out.

    Closes-Bug: #1755810
    Story: 2003922
    Change-Id: Ie2b813b2bdf181fb3c24743dbd13487ace6ee76a
    (cherry picked from commit 0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/675879
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f7e100d33fa26b1001a8ea978c633fdfaa9d4b75
Submitter: Zuul
Branch: stable/queens

commit f7e100d33fa26b1001a8ea978c633fdfaa9d4b75
Author: Kailun Qin <email address hidden>
Date: Wed May 1 15:50:48 2019 +0800

    Populate binding levels when concurrent ops fail

    Concurrent calls to _bind_port_if_needed may lead to a missing RPC
    notification which can cause a port stuck in a DOWN state. If the only
    caller that succeeds in the concurrency does not specify that an RPC
    notification is allowed then no RPC would be sent to the agent. The
    other caller which needs to send an RPC notification will fail since the
    resulting PortContext instance will not have any binding levels set.

    The failure has negative effects on consumers of the L2Population
    functionality because the L2Population mechanism driver will not be
    triggered to publish that a port is UP on a given compute node. Manual
    intervention is required in this case.

    This patch proposes to handle this by populating the PortContext with
    the current binding levels so that the caller can continue on and have
    an RPC notification sent out.

    Closes-Bug: #1755810
    Story: 2003922
    Change-Id: Ie2b813b2bdf181fb3c24743dbd13487ace6ee76a
    (cherry picked from commit 0dc730c7c0d3f0a49dee28d0d6e7ff9020d94443)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.3

This issue was fixed in the openstack/neutron 14.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.5

This issue was fixed in the openstack/neutron 13.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.1.1

This issue was fixed in the openstack/neutron 12.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.