Neurton linuxbridge agent does not create VXLAN interface

Bug #1952611 reported by masterpe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Mohammed Naser

Bug Description

High level description:
After a reboot the network controller, neutron linuxbridge agent does not create all network links correctly of the tenant networks.
This is behavior is not only on system where the node is rebooted also on nodes that are longer running. And not possible to respawn new networks/dhcp agents/l3 routers

* Pre-conditions:
I was able to reproduce the issue by deleting the vxlan interface of a tenant network and then reboot the node.
After the reboot the interface the namespace for example for the dhcp namespace
- does not get attached to the bridge
- the bridge does not get created
- the vxlan interace does not get created

* Step-by-step reproduction steps
1. Create about till you have about 115 tenant networks
2. reboot the network controller

* Expected output:
Expected behavior is that after the reboot vxlan get recreated and that we see the following:
interface ns-41a5e6e6-b3@if105 (inside the dhcp namespace) -> tap41a5e6e6-b3@if2 (outside the namespace) -> brq8c9de29c-e2 -> vxlan-149

* Actual output:
after the reboot:
  * interface ns-41a5e6e6-b3@if105 exists
  * interface tap41a5e6e6-b3@if2 exists
  * interface tap41a5e6e6-b3@if2 not attached to any bridge
  * bridge brq8c9de29c-e2 does not exists
  * vxlan interface does not exists

* Version:
  ** Openstack Ussuri
  ** Installed with Openstack ansible version 21.2.6
  ** neutron version 7.1.1
  ** Ubuntu Bionic

Perceived severity: The cluster is degrated state where is only possible to new
networks, dhcp agent and l3 router on one of the three network nodes.

Tags: linuxbridge
masterpe (michiel-y)
tags: added: linu neurton
tags: added: linux
removed: linu neurton
tags: added: linuxbridge neutron
removed: linux
description: updated
Revision history for this message
James Denton (james-denton) wrote :

Hi - thanks for the report. Can you confirm the linuxbridge agent has started properly, and if there's anything useful in the journal log? Same for the dhcp and l3 agents.

Also, are you running on metal or in lxc?

Revision history for this message
masterpe (michiel-y) wrote :

Yes, the linuxbridge has started properly. Sadly I could not find anything in the debug logs.

We are running the services l3, dhcp, metadata and linuxbridge agents on metal.

description: updated
tags: removed: neutron
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi Could you please upload the logs somewhere?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Can you provide the DHCP and the linux bridge agent logs? In debug mode if possible.

Regards.

Revision history for this message
masterpe (michiel-y) wrote :

You can find the all the logs of the services: https://raw.githubusercontent.com/mpiscaer/gitst/master/2021-11-30/journalctl.log

On tctrko1 have the following interfaces a missing bridge: https://raw.githubusercontent.com/mpiscaer/gitst/master/2021-11-30/tctrko1-missing-bridge

You can find the output of the following commands on https://github.com/mpiscaer/gitst/tree/master/2021-11-30 :

# brctl show
# ip link
# ip netns
# for foo in $(ip netns|awk '{print $1}'); do echo $foo; ip netn exec $foo ip addr; done # get all the interfaces inside of the namespaces

Revision history for this message
Mohammed Naser (mnaser) wrote (last edit ):

Adding some troubleshooting notes:

https://github.com/openstack/neutron/blob/0bdf3b56e0d4ede2d46eed09a4bb07dd3c00807d/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L258-L268

I'm going to assume the failure happens somewhere there, since either this function is not being called or failing somewhere.

EDIT1: OK, so that file hasn't changed since Ussuri at least which tells me there's nothing wrong with that part of the code.

Revision history for this message
Mohammed Naser (mnaser) wrote (last edit ):

I've managed to grab the following from GMR:

Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: Exception in thread privsep_reader:
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: Traceback (most recent call last):
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: self.run()
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "/usr/lib/python3.6/threading.py", line 864, in run
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: self._target(*self._args, **self._kwargs)
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "/openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/comm.py", line 134, in _reader_main
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: for msg in reader:
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "/openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/comm.py", line 78, in __next__
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: return next(self.unpacker)
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "msgpack/_unpacker.pyx", line 562, in msgpack._cmsgpack.Unpacker.__next__
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: File "msgpack/_unpacker.pyx", line 493, in msgpack._cmsgpack.Unpacker._unpack
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: ValueError: 1174941 exceeds max_str_len(1048576)

I'm starting to think that in cases where there is a lot of interfaces, the `msgpack` message is too big to go to privsep and therefore it dies.

EDIT: found this related bug https://review.opendev.org/c/openstack/neutron/+/777977

Revision history for this message
Mohammed Naser (mnaser) wrote :
Changed in neutron:
assignee: nobody → Mohammed Naser (mnaser)
Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.privsep 2.8.0

This issue was fixed in the openstack/oslo.privsep 2.8.0 release.

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.