[L3] floating IP failed to bind due to no agent gateway port(fip-ns)

Bug #1883089 reported by LIU Yulong
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Unassigned
neutron
Fix Released
Medium
Unassigned
neutron (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Hemanth Nakkina
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned
Impish
Fix Released
Undecided
Unassigned

Bug Description

In patch [1] it introduced a binding of DB uniq constraint for L3
agent gateway. In some extreme case the DvrFipGatewayPortAgentBinding
is in DB while the gateway port not. The current code path only checks
the binding existence which will pass a "None" port to the following
code path that results an AttributeError.

[1] https://review.opendev.org/#/c/702547/

Exception log:

2020-06-11 15:39:28.361 1285214 INFO neutron.db.l3_dvr_db [None req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway port for network 3fcb7702-ae0b-46b4-807f-8ae94d656dd3 does not exist on host host-compute-1. Creating one.
2020-06-11 15:39:28.370 1285214 DEBUG neutron.db.l3_dvr_db [None req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway port for network 3fcb7702-ae0b-46b4-807f-8ae94d656dd3 already exists on host host-compute-1. Probably it was just created by other worker. create_fip_agent_gw_port_if_not_exists /usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py:927
2020-06-11 15:39:28.390 1285214 DEBUG neutron.db.l3_dvr_db [None req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway port None found for the destination host: host-compute-1 create_fip_agent_gw_port_if_not_exists /usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py:933
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server [None req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Exception during message handling: AttributeError: 'NoneType' object has no attribute 'get'
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 170, in _process_incoming
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 91, in wrapped
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server setattr(e, '_RETRY_EXCEEDED', True)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 87, in wrapped
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 147, in wrapper
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server ectxt.value = e.inner_exc
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 135, in wrapper
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 126, in wrapped
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server LOG.debug("Retry wrapper got retriable exception: %s", e)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 122, in wrapped
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return f(*dup_args, **dup_kwargs)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 348, in get_agent_gateway_port
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server admin_ctx, network_id, host)
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py", line 953, in create_fip_agent_gw_port_if_not_exists
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server self._populate_mtu_and_subnets_for_ports(context, [agent_port])
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/l3_db.py", line 1978, in _populate_mtu_and_subnets_for_ports
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server for p in self._each_port_having_fixed_ips(ports)]
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/db/l3_db.py", line 1925, in _each_port_having_fixed_ips
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server fixed_ips = port.get('fixed_ips', [])
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server AttributeError: 'NoneType' object has no attribute 'get'
2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server

-------------------------------------------------------------------------

[SRU]

[Impact]
In some cases the DvrFipGatewayPortAgentBinding is in DB but the gateway port does not exist.
This resulted in connectivity issues to FIP for the new VMs launched on that compute node.
The fix creates the gateway port if it does not exist.

[Test Plan]
This is a race condition and difficult to reproduce. The test case simulated the error condition to verify the fix.

* Deploy openstack with dvr l3ha and centralised snat on neutron nodes
* Deploy instances and delete them. This step is to ensure FIP Agent gateway's are created on compute nodes

  Check the following command to see FIP Agent gateway information
  openstack port list --network ext_net -c id -c device_id -c binding_host_id -c device_owner -c fixed_ips | grep floatingip_agent_gateway

* Pick one of the compute node that has no instances and delete the FIP Agent gateway port (port id can be determined from above command)
  openstack port delete <port id>

* Launch an instance on the compute node
  openstack server create --wait --image cirros --flavor m1.cirros --nic net-id=<network id> --availability-zone nova:<hostname> cirros-test1

* Verify neutron-server logs for error
  ERROR oslo_messaging.rpc.server AttributeError: 'NoneType' object has no attribute 'get'

* Assign floating ip and tried to ping fip and the ping fails

[Where problems could occur]
The fix itself adds an extra check to determine the cases when the gateway port needs to be created.
And hence it is not expected to cause any regression.

LIU Yulong (dragon889)
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/735432

Changed in neutron:
status: Confirmed → In Progress
tags: added: l3-dvr-backlog
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/735762

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/735432
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5fdfd4cbfc56425c56b4a5702d92a97da56ab6e8
Submitter: Zuul
Branch: master

commit 5fdfd4cbfc56425c56b4a5702d92a97da56ab6e8
Author: LIU Yulong <email address hidden>
Date: Sat Jun 13 23:14:47 2020 +0800

    [L3] Check agent gateway port robustly

    In patch [1] it introduced a binding of DB uniq constraint for L3
    agent gateway. In some extreme case the DvrFipGatewayPortAgentBinding
    is in DB while the gateway port not. The current code path only checks
    the binding existence which will pass a "None" port to the following
    code path that results an AttributeError. This patch adds a simple check
    for that gateway port, if it is not created, new one.

    [1] https://review.opendev.org/#/c/702547/

    Closes-Bug: #1883089
    Change-Id: Ia90f2ee435b0a3476dbea028d3200cefe11e35e4

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/735762
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8dee0d9a4eb4282b989f2c77a79e55aa89554788
Submitter: Zuul
Branch: master

commit 8dee0d9a4eb4282b989f2c77a79e55aa89554788
Author: LIU Yulong <email address hidden>
Date: Tue Jun 16 10:02:24 2020 +0800

    [L3] Delete DvrFipGatewayPortAgentBindings after no gw ports

    This is the code behavior aligning for dvr related logical. The
    L3 dvr DB will remove all related FIP agent gateway port after there
    is no real use of it. But the DvrFipGatewayPortAgentBindings remain,
    it will cause the issue of new floating IP failed to bind. This
    patch adds the binding deleting action.

    Related-bug: #1883089
    Change-Id: I62c29e172bc8705dade11d37bb347241ef8ad5f8

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: LIU Yulong (dragon889) → nobody
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ussuri)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/779613
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/779613
Committed: https://opendev.org/openstack/neutron/commit/4093727ae9c751cce8c39dd0eb022b2c9b90d624
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 4093727ae9c751cce8c39dd0eb022b2c9b90d624
Author: LIU Yulong <email address hidden>
Date: Sat Jun 13 23:14:47 2020 +0800

    [L3] Check agent gateway port robustly

    In patch [1] it introduced a binding of DB uniq constraint for L3
    agent gateway. In some extreme case the DvrFipGatewayPortAgentBinding
    is in DB while the gateway port not. The current code path only checks
    the binding existence which will pass a "None" port to the following
    code path that results an AttributeError. This patch adds a simple check
    for that gateway port, if it is not created, new one.

    [1] https://review.opendev.org/#/c/702547/

    Closes-Bug: #1883089
    Change-Id: Ia90f2ee435b0a3476dbea028d3200cefe11e35e4
    (cherry picked from commit 5fdfd4cbfc56425c56b4a5702d92a97da56ab6e8)

tags: added: in-stable-ussuri
Changed in neutron (Ubuntu Impish):
status: New → Fix Released
Changed in neutron (Ubuntu Hirsute):
status: New → Fix Released
Changed in neutron (Ubuntu Groovy):
status: New → Fix Released
tags: added: sts
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
description: updated
description: updated
tags: added: sts-sru-needed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

SRU Team,

The fix has 2 commits (referred stable/ussuri below)
https://review.opendev.org/c/openstack/neutron/+/779614
https://review.opendev.org/c/openstack/neutron/+/779613

779614 is already part of focal (latest ussuri stable point release on Apr 12)
Uploaded debdiff with changes from 779613

Changed in neutron (Ubuntu Focal):
assignee: nobody → Hemanth Nakkina (hemanth-n)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.3.2

This issue was fixed in the openstack/neutron 16.3.2 release.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello LIU, or anyone else affected,

Accepted neutron into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:16.3.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Focal):
status: New → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Tested on focal with neutron package 2:16.3.1-0ubuntu1.1 and the test case is successful

* Deployed environment with dvr l3ha and centralised neutron snat gateways.

* Floating IP agent gateway exists on all nova-compute nodes (4) and neutron-gateway nodes (3) after launching VMs on all compute nodes
$ openstack port list --network ext_net -c id -c device_id -c binding_host_id -c device_owner -c fixed_ips | grep floatingip_agent_gateway | wc -l
7

* Deleted one of the Floating IP agent gateway port
$ openstack port list --network ext_net -c id -c device_id -c binding_host_id -c device_owner -c fixed_ips | grep floatingip_agent_gateway | wc -l
6

* Launched VM on the node where gateway port is deleted. Floating IP agent gateway came back on the node
$ openstack port list --network ext_net -c id -c device_id -c binding_host_id -c device_owner -c fixed_ips | grep floatingip_agent_gateway | wc -l
7

* ping to the floating ip successful
$ ping -c 1 10.5.151.84
PING 10.5.151.84 (10.5.151.84) 56(84) bytes of data.
64 bytes from 10.5.151.84: icmp_seq=1 ttl=62 time=293 ms

--- 10.5.151.84 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 292.825/292.825/292.825/0.000 ms

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:16.3.1-0ubuntu1.1

---------------
neutron (2:16.3.1-0ubuntu1.1) focal; urgency=medium

  [ Hemanth Nakkina ]
  * Fix to check L3 agent gateway port robustly (LP: #1883089)
    - d/p/0001-L3-Check-agent-gateway-port-robustly.patch.

 -- Chris MacNaughton <email address hidden> Thu, 06 May 2021 10:39:19 +0000

Changed in neutron (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello LIU, or anyone else affected,

Accepted neutron into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Verified the test case on bionic-ussuri and the test case works with the package in cloud:archive:ussuri-proposed

Deleted the floatin ip agent gateway on one of the compute node and launched a new VM on that compute and assigned FIP. Able to ping Floating IP.

$ ping -c 1 10.5.153.114
PING 10.5.153.114 (10.5.153.114) 56(84) bytes of data.
64 bytes from 10.5.153.114: icmp_seq=1 ttl=62 time=3.66 ms

--- 10.5.153.114 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.662/3.662/3.662/0.000 ms

tags: added: verification-done verification-ussuri-done
removed: verification-needed verification-ussuri-needed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

UCA Ussuri is released to ussuri-updates in package 2:16.3.2-0ubuntu3~cloud0, so marking the status as Fix released for UCA Ussuri

Changed in cloud-archive:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers