snat arp entry missing in qrouter namespace

Bug #1933092 reported by Hemanth Nakkina
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Xena
Ussuri
Undecided
Unassigned
Victoria
Undecided
Unassigned
Wallaby
Undecided
Unassigned
Xena
Undecided
Unassigned
neutron
Undecided
Hemanth Nakkina
neutron (Ubuntu)
Status tracked in Impish
Focal
Medium
Unassigned
Groovy
Medium
Unassigned
Hirsute
Medium
Unassigned
Impish
Medium
Unassigned

Bug Description

[Impact]
Load Balancers deployed on the cloud are unreachable

[Test Case]
1. Deploy openstack with atleast 4 compute nodes with networking features DVR SNAT+L3HA
2. Execute the script test_snat_arp_entry.sh
3. The script loops for 20 times creating network, router and connecting router to external, internal network and checking if ARP entries are populated properly on qrouter namespaces
4. The script stops if arp entries are missing.
5. If the script runs for 20 loops, then there are no issues.

[Regression Potential]
The issue only happens a few times when a router is created, external gateway set and internal subnet attached to router in quick succession. In other cases, the arp entry of snat is already added.
The fix just adds extra logic to add arp entry retrieving snat information from the router. In working cases, this extra logic will execute commands to add arp entry twice which should not cause further issues.

[Original Bug Report]
In one of the cloud environment, the FIP attached to the Octavia Loadbalancer VIP is not reachable. After analysis, we found the ARP entry for SNAT IP is missing in the qrouter namespace where Amphora VM is running. And so the return packets are not forwarded from qrouter to snat on active l3-agent node.

Version:
Ubuntu Ussuri packages (16.3.2 point release)
DVR+SNAT+L3HA enabled

Expectation is to have PERMANENT arp entry for snat ip on qrouter namespace on all compute nodes
192.168.33.238 dev qr-4ee692e0-7a lladdr fa:16:3e:25:6a:73 used 38/38/38 probes 0 PERMANENT

How to reproduce:

Attaching a script to simulate the problem (without octavia) with following steps
1. network/subnet/router is created, network attached to router
2. verify if qrouter on all compute nodes has arp entries related to snat ip
3. if arp entries exists, delete network/subnet/router
4. Repeat steps 1,2,3 until missing arp entry is observed.

I am able to reproduce missing arp entry sometimes in 3rd loop and sometimes in 6th loop.

Observed arp entries for snat ip is updated at the following places [1] [2] but get_snat_interfaces() and get_ports_by_subnet() are not updated with snat ip in non-working cases.

[1] https://opendev.org/openstack/neutron/src/commit/dfd04115b059c2263cdd8ac44ccc2ec47614bcc3/neutron/agent/l3/dvr_local_router.py#L570
[2] https://opendev.org/openstack/neutron/src/commit/dfd04115b059c2263cdd8ac44ccc2ec47614bcc3/neutron/agent/l3/dvr_local_router.py#L317

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Test script to verify if arp entries for snat ip exists

description: updated
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Here is the ip neigh show output for non-working scenario (SNAT IP: 192.168.33.238 is missing on nova-compute/2, nova-compute/3)

$ juju run -a nova-compute -- sudo ip netns exec qrouter-db77f537-ca82-4d25-8539-d7891e2d6b8a ip -s neigh show
- Stdout: |
    169.254.103.137 dev rfp-db77f537-c lladdr 76:31:59:17:f2:b5 used 3215/3215/3215 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3216/3216/3216 probes 0 PERMANENT
    192.168.33.238 dev qr-4ee692e0-7a lladdr fa:16:3e:25:6a:73 used 3214/3214/3214 probes 0 PERMANENT
  UnitId: nova-compute/0
- Stdout: |
    192.168.33.238 dev qr-4ee692e0-7a lladdr fa:16:3e:25:6a:73 used 3214/3214/3214 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3218/3218/3218 probes 0 PERMANENT
    169.254.75.95 dev rfp-db77f537-c lladdr 1a:f5:09:ef:2f:b1 used 3215/3215/3215 probes 0 PERMANENT
  UnitId: nova-compute/1
- Stdout: |
    169.254.75.95 dev rfp-db77f537-c lladdr 4a:6e:d0:af:88:10 used 3226/3226/3226 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3214/3214/3214 probes 0 PERMANENT
  UnitId: nova-compute/2
- Stdout: |
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3218/3218/3218 probes 0 PERMANENT
    169.254.75.95 dev rfp-db77f537-c lladdr a2:63:61:ca:14:31 used 3229/3229/3229 probes 0 PERMANENT
  UnitId: nova-compute/3

Launched a test VM on nova-compute/2. VM got IP 192.168.33.171 and corresponding arp entry added. At this point of time SNAT IP is also added but it is in FAILED state.

$ juju run -a nova-compute -- sudo ip netns exec qrouter-db77f537-ca82-4d25-8539-d7891e2d6b8a ip -s neigh show
- Stdout: |
    169.254.103.137 dev rfp-db77f537-c lladdr 76:31:59:17:f2:b5 used 3292/3292/3292 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3293/3293/3293 probes 0 PERMANENT
    192.168.33.238 dev qr-4ee692e0-7a lladdr fa:16:3e:25:6a:73 used 9/3291/3291 probes 0 PERMANENT
    fe80::f816:3eff:fe1a:3c82 dev qr-4ee692e0-7a lladdr fa:16:3e:1a:3c:82 used 12/72/12 probes 0 STALE
  UnitId: nova-compute/0
- Stdout: |
    192.168.33.238 dev qr-4ee692e0-7a lladdr fa:16:3e:25:6a:73 used 3291/3291/3291 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3296/3296/3296 probes 0 PERMANENT
    169.254.75.95 dev rfp-db77f537-c lladdr 1a:f5:09:ef:2f:b1 used 3293/3293/3293 probes 0 PERMANENT
  UnitId: nova-compute/1
- Stdout: |
    192.168.33.238 dev qr-4ee692e0-7a used 6/69/6 probes 6 FAILED
    192.168.33.171 dev qr-4ee692e0-7a lladdr fa:16:3e:1a:3c:82 ref 1 used 6/2/2 probes 1 REACHABLE
    169.254.75.95 dev rfp-db77f537-c lladdr 4a:6e:d0:af:88:10 used 3303/3303/3303 probes 0 PERMANENT
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3291/3291/3291 probes 0 PERMANENT
  UnitId: nova-compute/2
- Stdout: |
    192.168.33.2 dev qr-4ee692e0-7a lladdr fa:16:3e:4a:1e:a4 used 3295/3295/3295 probes 0 PERMANENT
    169.254.75.95 dev rfp-db77f537-c lladdr a2:63:61:ca:14:31 used 3306/3306/3306 probes 0 PERMANENT
  UnitId: nova-compute/3

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

After looking closer, the problem is in step 1 - network/subnet/router is created, network attached to router

The script which is used for reproducer, while attaching network to router is not waiting until the router is active with 1 active/2 stanadby l3 agents. So the snat arp entry is not populated in cases where network is attached to router before the router completes l3 agent assignment and completion of ha setup.

Adding sleep of 60 seconds (sufficient enough in my environment for router to properly set with ha) before adding network to router, I dont see the missing snat arp entry.

Leaving this bug open to see if it is worth exploring sending post update after router HA is setup to update snat arp entry.

Akihiro Motoki (amotoki)
tags: added: l3-dvr-backlog
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I have env for debug that so I will try to reproduce and triage it at least.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/799197

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Hemanth Nakkina (hemanth-n)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/799197
Committed: https://opendev.org/openstack/neutron/commit/be7d0bb6abc893e53dfc864c52506928b1d38fa3
Submitter: "Zuul (22348)"
Branch: master

commit be7d0bb6abc893e53dfc864c52506928b1d38fa3
Author: Hemanth Nakkina <email address hidden>
Date: Fri Jul 2 17:01:55 2021 +0530

    Update arp entry of snat port on qrouter ns

    In some cases, the arp entry of snat port is not updated
    in qrouter namespace. l3-agent calls get_ports_by_subnet()
    while setting arps for the subnet. And the snat port is
    not returned if it is still unbound. One of the scenario
    this is observed is when router is created, external
    gateway set and internal subnet attached to router in
    quick succession.

    This patch retrieves snat port details from router info
    as well and updates arp entry for snat port.

    Closes-Bug: #1933092
    Change-Id: I7ee797b4b930306cf6360922d855f8b24f1b813d

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/799375

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/799476

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/799478

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/799375
Committed: https://opendev.org/openstack/neutron/commit/85a668dd1629d71f52ff20da3d848e2cbb397e43
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 85a668dd1629d71f52ff20da3d848e2cbb397e43
Author: Hemanth Nakkina <email address hidden>
Date: Fri Jul 2 17:01:55 2021 +0530

    Update arp entry of snat port on qrouter ns

    In some cases, the arp entry of snat port is not updated
    in qrouter namespace. l3-agent calls get_ports_by_subnet()
    while setting arps for the subnet. And the snat port is
    not returned if it is still unbound. One of the scenario
    this is observed is when router is created, external
    gateway set and internal subnet attached to router in
    quick succession.

    This patch retrieves snat port details from router info
    as well and updates arp entry for snat port.

    Closes-Bug: #1933092
    Change-Id: I7ee797b4b930306cf6360922d855f8b24f1b813d
    (cherry picked from commit be7d0bb6abc893e53dfc864c52506928b1d38fa3)

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/799476
Committed: https://opendev.org/openstack/neutron/commit/8a43eb4563e787ea43e15cb49968818059a34c2b
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8a43eb4563e787ea43e15cb49968818059a34c2b
Author: Hemanth Nakkina <email address hidden>
Date: Fri Jul 2 17:01:55 2021 +0530

    Update arp entry of snat port on qrouter ns

    In some cases, the arp entry of snat port is not updated
    in qrouter namespace. l3-agent calls get_ports_by_subnet()
    while setting arps for the subnet. And the snat port is
    not returned if it is still unbound. One of the scenario
    this is observed is when router is created, external
    gateway set and internal subnet attached to router in
    quick succession.

    This patch retrieves snat port details from router info
    as well and updates arp entry for snat port.

    Closes-Bug: #1933092
    Change-Id: I7ee797b4b930306cf6360922d855f8b24f1b813d
    (cherry picked from commit be7d0bb6abc893e53dfc864c52506928b1d38fa3)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/799478
Committed: https://opendev.org/openstack/neutron/commit/f1a9f4ed62fd2567cac174f80f87de53148ea7b9
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit f1a9f4ed62fd2567cac174f80f87de53148ea7b9
Author: Hemanth Nakkina <email address hidden>
Date: Fri Jul 2 17:01:55 2021 +0530

    Update arp entry of snat port on qrouter ns

    In some cases, the arp entry of snat port is not updated
    in qrouter namespace. l3-agent calls get_ports_by_subnet()
    while setting arps for the subnet. And the snat port is
    not returned if it is still unbound. One of the scenario
    this is observed is when router is created, external
    gateway set and internal subnet attached to router in
    quick succession.

    This patch retrieves snat port details from router info
    as well and updates arp entry for snat port.

    Closes-Bug: #1933092
    Change-Id: I7ee797b4b930306cf6360922d855f8b24f1b813d
    (cherry picked from commit be7d0bb6abc893e53dfc864c52506928b1d38fa3)

description: updated
tags: added: sts sts-sru-needed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote (last edit ):

SRU team,
All debdiffs for Ubuntu I/H/G/F and UCA X/W/V/U are uploaded.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Debdiff for impish" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
Dan Streetman (ddstreet) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.4.0

This issue was fixed in the openstack/neutron 16.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.2.0

This issue was fixed in the openstack/neutron 17.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.1.0

This issue was fixed in the openstack/neutron 18.1.0 release.

Mathew Hodson (mhodson)
Changed in neutron (Ubuntu Focal):
importance: Undecided → Medium
Changed in neutron (Ubuntu Groovy):
importance: Undecided → Medium
Changed in neutron (Ubuntu Hirsute):
importance: Undecided → Medium
Changed in neutron (Ubuntu Impish):
importance: Undecided → Medium
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:18.1.0+git2021072117.147830620f-0ubuntu1

---------------
neutron (2:18.1.0+git2021072117.147830620f-0ubuntu1) impish; urgency=medium

  [ Hemanth Nakkina ]
  * d/p/0001-Update-arp-entry-of-snat-port-on-qrouter-ns.patch: Fix to
    update arp entry of snat port on qrouter ns (LP: #1933092)

  [ Corey Bryant ]
  * New upstream snapshot for OpenStack Xena.
  * d/control: Align (Build-)Depends with upstream.
  * d/p/*: Rebased.
  * d/p/0001-Update-arp-entry-of-snat-port-on-qrouter-ns.patch: Dropped.
    Fixed in upstream snapshot.

 -- Corey Bryant <email address hidden> Wed, 21 Jul 2021 17:22:28 -0400

Changed in neutron (Ubuntu Impish):
status: New → Fix Released
Changed in neutron (Ubuntu Focal):
status: New → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/807244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers