neutron-ovn-metadata-agent does not respond on network until restarted after SB disconnects

Bug #1953591 reported by Gabriel Barazer
66
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
networking-ovn
New
Undecided
Unassigned
neutron (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,

We are running OpenStack Xena in a highly dynamic environment with a few hundreds tenant networks and projects and are using OVN setup in a 3-node cluster for the northbound and southbound databases as well as the Northd daemon.

With a few hundreds instances, we started to notice that when starting new instances, the instances get their DHCP information via OVN Openflow rules, but cloud-init was not installing the initial configuration (easily recognizable by looking at the log: the console prompt at the end show that the instance did not have its name configured, nor the ssh public key).

After some investigations we pinpointed noticed that on these instances the neutron-ovn-metadata-agent was not responding on IP 169.254.169.254. After restarting the agent on the corresponding host hypervisor and waiting a few seconds, a simple reboot of the instance would fix the issue.

It seems the instance metadata server is not reachable specifically when we are starting an instance in a new network/subnet.

We then enabled the debug logs on the metadata agent and only noticed that the agents are being disconnected from the SB DB then reconnected immediately but without any additionnal relevant log messages.

First we looked at our OVN cluster status and noticed that the cluster was flapping very frequently (changing NB, SB and northd leaders) and fixed that as well with adjusting the inactivity probe and election timers.

Since then the OVN cluster is pretty stable and only changes leader (and increment term) when the SB leader voluntarily transfers leadership to take a snapshot of the database, every few hours according to the "Last election started" timer.

It seems the neutron-ovn-metadata-agent still breaks when the OVN SB changes leaders (and SB is disconnected and reconnected again), and does not respond in new networks until restarted.

My understanding is that the metadata agent usually creates a new haproxy instance in a dedicated namespace on the host where the instance is created, but fails to do so as soon as it's disconnected from the SB DB, even after reconnecting to the new SB leader (almost instantly)

The real issue here is that there is no logs other than the usual disconnects/reconnects when this happens.

The "openstack network agent list" reports the agent down as well when this occurs and for now we had no other choice than restarting the metadata agent every 5 minutes to somewhat make this issue invisible to our end users.

Is anyone already having this issue ?
We are running Openstack Xena deployed using kolla-ansible, using container images (Centos 8 Stream+ Openstack source) built with kolla. The relevant versions are :
Python neutron 19.0.1.dev8
Python ovs 2.13.3
Python ovsdbapp 1.12.0
haproxy 1.8.27-2.el8
openvswitch2.15-2.15.0-41.el8s

neutron_ovn_metadata_agent.ini:
[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
ovsdb_timeout = 10

[ovn]
ovn_nb_connection = tcp:XXX:6641,tcp:YYY:6641,tcp:ZZZ:6641
ovn_sb_connection = tcp:XXX:6642,tcp:YYY:6642,tcp:ZZZ:6642
ovn_metadata_enabled = true

We literally have no other logs than the OVN disconnection and reconnection lines :
2021-12-08 09:11:21.030 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: waiting 2 seconds before reconnect
2021-12-08 09:11:21.033 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
2021-12-08 09:11:21.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
2021-12-08 09:11:22.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
2021-12-08 09:11:22.035 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected
2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected

Revision history for this message
Nikolay Fedorov (jingvar) wrote :

we have the same.

Revision history for this message
Igor (aigor) wrote :

Same here, but in Wallaby

Revision history for this message
Javier Cacheiro (javicacheiro) wrote :

We were experiencing the same issue both in Xena and previously in Wallaby.

In our case, increasing the ovsdb_timeout seems to have fixed it:

    ovsdb_timeout: 30

To add some info that can complement the detailed analysis provided by Gabriel:

Inside the neutron_ovn_metadata_agent container, the metadata agent launches a new haproxy for each network that is required in the host. Each haproxy process is responsible for redirecting the requests from this network to the python neutron-ovn-metadata-agent that is running in the container.

When the haproxy goes down for some network, the machines in this network are not able to contact the metatada, but you do not see anything in the logs because there is no connection to the python neutron-ovn-metadata-agent process (that is the one actually generating the logging).

Once you restart the container it relaunches all the requiered haproxy processes so that the virtual machines of the affected networks can again access the metadata service.

So the key of the issue is related to the namespace generated for this network and the the haproxy that connects this namespace with the python neutron-ovn-metadata-agent.

As Gabriel points, it seems that the ovsdb SB is involved in some way because after increasing the ovsdb_timeout we have not experienced again this issue in the last weeks.

Revision history for this message
Satish Patel (satish-txt) wrote :

I'm having same issue with CentOS 8 stream / Wallaby / kolla-ansible.

I have tried following, added ovsdb_timeout=30 in ovn_meta config file. but still i am seeing issue time to time. I have also setup cronjob to hide this issue but look like this is real issue. why don't we keep haproxy running all the time and not spawning when vm get created? I believe openstack-ansible doing that approach where dedicated haproxy namespace running for this task.

# cat /etc/neutron/neutron_ovn_metadata_agent.ini
[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
ovsdb_timeout = 30

Revision history for this message
Markus Lindenblatt (0-markus) wrote :

We are also affected with all environments we are running. The most recent one is running wallaby release. We tried to raise the ovsdb_timeout to 30 but that is not the solution for us.
Yesterday morning we ran into the problem again after leader change of ovn-sb-db:

from neutron-ovn-metadata-agent.log:
2022-03-09 07:58:23.803 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.804 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.807 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connecting...
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connected
2022-03-09 07:58:23.837 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.839 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connection closed by client
2022-03-09 07:58:24.840 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:24.841 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected

after the reconnect to the new leader metadata service is not delivering any metadata to new instances and gives no further log entries.

After restarting the metadata agents they are working again until a new leader change of ovn-sb-db.

We are using neutron-ovn-metadata-agent version 18.1.2.dev77

Revision history for this message
Ivan Zhang (sail4dream) wrote :

is this the same bug described in https://bugs.launchpad.net/neutron/+bug/1952550 ?

Revision history for this message
Tiago Pires (tiagohp) wrote :

Hi all,

I have a similar issue with Ussuri also, in my case the ovsdb_timeout is 180 and I'm seeing the following logs on the compute node side every time the OVN Central change the leader because of the snapshot process. It will break the SB connections with the metadata agents and it will do a full sync again after the metadata agent finds the new leader:

May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.316 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599699]: 2022-05-11 21:05:11.318 1599699 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.318 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: connection closed by client
<ommited>
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.332 1599679 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:124
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.339 1599679 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.354 1599679 DEBUG neutron.agent.ovn.metadata.agent [-] Provisioning metadata for network 46cc279b-a6fc-41d8-b4f9-d161bf7f9ef4 provision_datapath /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:392

Does someone have a workaround?

More about the metadata agent process here:
https://docs.openstack.org/networking-ovn/latest/contributor/design/metadata_api.html#:~:text=neutron%2Dovn%2Dmetadata%2Dagent%20is%20the%20process%20that%20will,reach%20the%20appropriate%20host%20network.

Tiago Pires

Revision history for this message
Nikolay Fedorov (jingvar) wrote :

We use Xena with
ovs 2.17.1.post1
ovsdbapp 1.16.0
and have new problem
when Raft happens, neutron-server (ml2 plugin) gets ovsdb timeout (180s by default).
Seems it doesn't catch changing leader and doesn't reconnect.

Revision history for this message
Satish Patel (satish-txt) wrote :

Why this bug still undecided even after many folks shared same story?

Revision history for this message
Alan Baghumian (alanbach) wrote :

I have personally experience this with my Openstack Focal/Xena based cluster and have another client reporting the same issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.