neutron-l3-agents won't become active

Bug #1917409 reported by Brad Marshall
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Unassigned
neutron (Ubuntu)
New
Undecided
Unassigned

Bug Description

We have a Ubuntu Ussari cloud deployed on Ubuntu 20.04 using the juju charms from the 20.08 bundle (planning to upgrade soon).

The problem that is occuring that all l3 agents for routers using a particular external network show up with their ha_state in standby. I've tried removing and re-adding, and we never see the state go to active.

$ neutron l3-agent-list-hosting-router bradm-router
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+-------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------+----------------+-------+----------+
| 09ae92c9-ae8f-4209-b1a8-d593cc6d6602 | oschv1.maas | True | :-) | standby |
| 4d9fe934-b1f8-4c2b-83ea-04971f827209 | oschv2.maas | True | :-) | standby |
| 70b8b60e-7fbd-4b3a-80a3-90875ca72ce6 | oschv4.maas | True | :-) | standby |
+--------------------------------------+-------------+----------------+-------+----------+

This generates a stack trace:

2021-03-01 02:59:47.344 3675486 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'get'
Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
    res = self.dispatcher.dispatch(message)

  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 276, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)

  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 196, in _do_dispatch
    result = func(ctxt, **new_args)

  File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 139, in wrapped
    setattr(e, '_RETRY_EXCEEDED', True)

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
    raise value

  File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 135, in wrapped
    return f(*args, **kwargs)

  File "/usr/lib/python3/dist-packages/oslo_db/api.py", line 154, in wrapper
    ectxt.value = e.inner_exc

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
    raise value

  File "/usr/lib/python3/dist-packages/oslo_db/api.py", line 142, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 183, in wrapped
    LOG.debug("Retry wrapper got retriable exception: %s", e)

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()

  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
    raise value

  File "/usr/lib/python3/dist-packages/neutron_lib/db/api.py", line 179, in wrapped
    return f(*dup_args, **dup_kwargs)

  File "/usr/lib/python3/dist-packages/neutron/api/rpc/handlers/l3_rpc.py", line 306, in get_agent_gateway_port
    agent_port = self.l3plugin.create_fip_agent_gw_port_if_not_exists(

  File "/usr/lib/python3/dist-packages/neutron/db/l3_dvr_db.py", line 1101, in create_fip_agent_gw_port_if_not_exists
    self._populate_mtu_and_subnets_for_ports(context, [agent_port])

  File "/usr/lib/python3/dist-packages/neutron/db/l3_db.py", line 1772, in _populate_mtu_and_subnets_for_ports
    network_ids = [p['network_id']

  File "/usr/lib/python3/dist-packages/neutron/db/l3_db.py", line 1772, in <listcomp>
    network_ids = [p['network_id']

  File "/usr/lib/python3/dist-packages/neutron/db/l3_db.py", line 1720, in _each_port_having_fixed_ips
    fixed_ips = port.get('fixed_ips', [])

This system was running successfully after deployment, and has been left running for a while and when it was revisited was in this state. I've been unable to successfully debug what has caused it to be in this state.

Versions:
Ubuntu 20.04
Juju charms 20.08
Openstack ussari
Environment: Clustered services using containers on converged hypervisors

$ dpkg-query -W neutron-common
neutron-common 2:16.2.0-0ubuntu2

Please let me know if there is any further information that could be used to see what is happening here.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Can You share with us Your full log from the neutron-server (where I assume this issue happened) and l3-agent?
Maybe You can also tell us what extension drivers are You using in Your neutron-l3-agent and what service-plugins do You have in neutron-server loaded.

Is that issue happening only for one router or for all routers which You have there?

Changed in neutron:
importance: Undecided → High
Revision history for this message
Brad Marshall (brad-marshall) wrote :

The traceback is happening in the neutron-l3-agent logs, I'll attach the file with the error in it.

neutron-l3-agent has the fwaas_v2 extension enabled.

neutron-server has the following service plugin line:

 service_plugins = router,firewall_v2,metering,segments,neutron_dynamic_routing.services.bgp.bgp_plugin.BgpPlugin

I can confirm it is occuring for all routers that have an external network, I'll do some investigating on the rest.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Attached is the neutron-l3-agent log for the time when I was adding and removing the routers from the l3-agents, which includes the stack trace. I've redacted public IP addresses, I don't think that's required.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Thx for the logs, we will try to investigate that.
Can You also check if the same issue happens in current master branch?

tags: added: l3-ha
Revision history for this message
Brad Marshall (brad-marshall) wrote :

Unfortunately we don't currently have an easy way to test the master branch, we've deployed this system using juju charms.

Revision history for this message
Michael D'Silva (madsi1m) wrote :

Made this change to fix the python code from crashing https://review.opendev.org/c/openstack/neutron/+/779406
Not sure if this fixes the problem or simply fixes a symptom, but does stop python from crashing.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

I see this is marked as a duplicate of LP#1883089, is there any chance this fix will be backported to Ubuntu Focal + Ussari?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Brad:

You can propose any patch in gerrit whenever you want. This is the backport list for this patch: https://review.opendev.org/q/8dee0d9a4eb4282b989f2c77a79e55aa89554788

Regards.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Argh, perhaps I've made things worse, I added an ubuntu source neutron task for this, unclicked the 'duplicate' bug, but that sets the wrong state for the upstream neutron, which was handled in https://bugs.launchpad.net/neutron/+bug/1883089 -- I'm not sure how to undo the mess I've made. Anyway, Brad mentions this affects the ubuntu package.

Changed in neutron:
status: New → Fix Released
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/779406
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

As the fix this was a duplicate of has been abandoned, can this still be addressed somehow? Is there anything I can do to move this forward?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.