[L3] existing router resources are partial deleted unexpectedly when MQ is gone

Bug #1871850 reported by LIU Yulong
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Queens
Critical
Unassigned
Rocky
Critical
Unassigned
Stein
Undecided
Unassigned
Train
Undecided
Unassigned
Ussuri
Undecided
Unassigned
neutron
Critical
Brian Haley
neutron (Ubuntu)
Undecided
Unassigned
Bionic
Critical
Trent Lloyd

Bug Description

(For SRU template, please see bug 1869808, as the SRU info there applies to this bug also)

ENV: meet this issue on our stable/queens deployment, but master branch has the same code logic

When the L3 agent get a router update notification, it will try to retrieve the router info from DB server [1]. But at this time, if the message queue is down/unreachable. It will get exceptions related message queue. A resync action will be run then [2]. Sometimes, from my personal experience, rabbitMQ cluster is not so much easy to recover. Long time MQ recover time will cause the router info sync RPC never get successful until it meets the max retry time [3]. So the bad thing happens, L3 agent is trying to remove the router now [4]. It basically shutdown all the existing L3 traffic of this router.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L705
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L710
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L666
[4] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L671

tags: added: l3-dvr-backlog
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/719127

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: Confirmed → In Progress
Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/719127
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=12b9149e20665d80c11f1ef3d2283e1fa6f3b693
Submitter: Zuul
Branch: master

commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/748123

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/748124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/748125

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/748126

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/748127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/748126
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5ee377952badd94d08425aab41853916092acd07
Submitter: Zuul
Branch: stable/rocky

commit 5ee377952badd94d08425aab41853916092acd07
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
    (cherry picked from commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/748125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=71f22834f2240834ca591e27a920f9444bac9689
Submitter: Zuul
Branch: stable/stein

commit 71f22834f2240834ca591e27a920f9444bac9689
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
    (cherry picked from commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/748124
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a96ad52c7e57664c63e3675b64718c5a288946fb
Submitter: Zuul
Branch: stable/train

commit a96ad52c7e57664c63e3675b64718c5a288946fb
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
    (cherry picked from commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/748127
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ec6c98060d78c97edf6382ede977209f007fdb81
Submitter: Zuul
Branch: stable/queens

commit ec6c98060d78c97edf6382ede977209f007fdb81
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Conflicts:
            neutron/agent/l3/agent.py
            neutron/tests/unit/agent/l3/test_agent.py

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
    (cherry picked from commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/748123
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5eeb98cdb51dc0dadd43128d1d0ed7d497606ded
Submitter: Zuul
Branch: stable/ussuri

commit 5eeb98cdb51dc0dadd43128d1d0ed7d497606ded
Author: LIU Yulong <email address hidden>
Date: Sat Apr 11 08:41:28 2020 +0800

    Not remove the running router when MQ is unreachable

    When the L3 agent get a router update notification, it will try to
    retrieve the router info from neutron server. But at this time, if
    the message queue is down/unreachable. It will get exceptions related
    message queue. The resync actions will be run then. Sometimes, rabbitMQ
    cluster is not so much easy to recover. Then Long time MQ recover time
    will cause the router info sync RPC never get successful until it meets
    the max retry time. Then the bad thing happens, L3 agent is trying to
    remove the router now. It basically shutdown all the existing L3 traffic
    of this router.

    This patch directly removes the final router removal action, let the
    router run as it is.

    Closes-Bug: #1871850
    Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
    (cherry picked from commit 12b9149e20665d80c11f1ef3d2283e1fa6f3b693)

tags: added: in-stable-ussuri
tags: removed: neutron-proactive-backport-potential
summary: - [L3] existing router resources are partial deleted unexceptedly when MQ
+ [L3] existing router resources are partial deleted unexpectedly when MQ
is gone
Changed in cloud-archive:
status: New → Invalid
Dan Streetman (ddstreet)
Changed in neutron (Ubuntu):
status: New → Fix Released
Changed in neutron (Ubuntu Bionic):
assignee: nobody → Trent Lloyd (lathiat)
importance: Undecided → Critical
status: New → In Progress
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello LIU, or anyone else affected,

Accepted neutron into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello LIU, or anyone else affected,

Accepted neutron into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:12.1.1-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello LIU, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

All SRU verification completed and performed in https://bugs.launchpad.net/neutron/+bug/1869808 so please refer to that LP for the results.

tags: added: verification-done verification-done-bionic verification-queens-done verification-rocky-done
removed: verification-needed verification-needed-bionic verification-queens-needed verification-rocky-needed
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:12.1.1-0ubuntu4

---------------
neutron (2:12.1.1-0ubuntu4) bionic; urgency=medium

  * Fix interrupt of VLAN traffic on reboot of neutron-ovs-agent:
  - d/p/0001-ovs-agent-signal-to-plugin-if-tunnel-refresh-needed.patch (LP: #1853613)
  - d/p/0002-Do-not-block-connection-between-br-int-and-br-phys-o.patch (LP: #1869808)
  - d/p/0003-Ensure-that-stale-flows-are-cleaned-from-phys_bridge.patch (LP: #1864822)
  - d/p/0004-DVR-Reconfigure-re-created-physical-bridges-for-dvr-.patch (LP: #1864822)
  - d/p/0005-Ensure-drop-flows-on-br-int-at-agent-startup-for-DVR.patch (LP: #1887148)
  - d/p/0006-Don-t-check-if-any-bridges-were-recrected-when-OVS-w.patch (LP: #1864822)
  - d/p/0007-Not-remove-the-running-router-when-MQ-is-unreachable.patch (LP: #1871850)

 -- Edward Hope-Morley <email address hidden> Mon, 22 Feb 2021 16:55:40 +0000

Changed in neutron (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:12.1.1-0ubuntu4~cloud0
---------------

 neutron (2:12.1.1-0ubuntu4~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:12.1.1-0ubuntu4) bionic; urgency=medium
 .
   * Fix interrupt of VLAN traffic on reboot of neutron-ovs-agent:
   - d/p/0001-ovs-agent-signal-to-plugin-if-tunnel-refresh-needed.patch (LP: #1853613)
   - d/p/0002-Do-not-block-connection-between-br-int-and-br-phys-o.patch (LP: #1869808)
   - d/p/0003-Ensure-that-stale-flows-are-cleaned-from-phys_bridge.patch (LP: #1864822)
   - d/p/0004-DVR-Reconfigure-re-created-physical-bridges-for-dvr-.patch (LP: #1864822)
   - d/p/0005-Ensure-drop-flows-on-br-int-at-agent-startup-for-DVR.patch (LP: #1887148)
   - d/p/0006-Don-t-check-if-any-bridges-were-recrected-when-OVS-w.patch (LP: #1864822)
   - d/p/0007-Not-remove-the-running-router-when-MQ-is-unreachable.patch (LP: #1871850)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers