Bug #1953591 “neutron-ovn-metadata-agent does not respond on net...” : Bugs : networking-ovn

Revision history for this message

Nikolay Fedorov (jingvar) wrote on 2022-01-25:

#1

we have the same.

Revision history for this message

Igor (aigor) wrote on 2022-01-28:

#2

Same here, but in Wallaby

Revision history for this message

Javier Cacheiro (javicacheiro) wrote on 2022-02-09:

#3

We were experiencing the same issue both in Xena and previously in Wallaby.

In our case, increasing the ovsdb_timeout seems to have fixed it:

ovsdb_timeout: 30

To add some info that can complement the detailed analysis provided by Gabriel:

Inside the neutron_ovn_metadata_agent container, the metadata agent launches a new haproxy for each network that is required in the host. Each haproxy process is responsible for redirecting the requests from this network to the python neutron-ovn-metadata-agent that is running in the container.

When the haproxy goes down for some network, the machines in this network are not able to contact the metatada, but you do not see anything in the logs because there is no connection to the python neutron-ovn-metadata-agent process (that is the one actually generating the logging).

Once you restart the container it relaunches all the requiered haproxy processes so that the virtual machines of the affected networks can again access the metadata service.

So the key of the issue is related to the namespace generated for this network and the the haproxy that connects this namespace with the python neutron-ovn-metadata-agent.

As Gabriel points, it seems that the ovsdb SB is involved in some way because after increasing the ovsdb_timeout we have not experienced again this issue in the last weeks.

Revision history for this message

Satish Patel (satish-txt) wrote on 2022-03-02:

#4

I'm having same issue with CentOS 8 stream / Wallaby / kolla-ansible.

I have tried following, added ovsdb_timeout=30 in ovn_meta config file. but still i am seeing issue time to time. I have also setup cronjob to hide this issue but look like this is real issue. why don't we keep haproxy running all the time and not spawning when vm get created? I believe openstack-ansible doing that approach where dedicated haproxy namespace running for this task.

# cat /etc/neutron/neutron_ovn_metadata_agent.ini
[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
ovsdb_timeout = 30

Revision history for this message

Markus Lindenblatt (0-markus) wrote on 2022-03-10:

#5

We are also affected with all environments we are running. The most recent one is running wallaby release. We tried to raise the ovsdb_timeout to 30 but that is not the solution for us.
Yesterday morning we ran into the problem again after leader change of ovn-sb-db:

from neutron-ovn-metadata-agent.log:
2022-03-09 07:58:23.803 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.804 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.807 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connecting...
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connected
2022-03-09 07:58:23.837 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.839 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connection closed by client
2022-03-09 07:58:24.840 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:24.841 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected

after the reconnect to the new leader metadata service is not delivering any metadata to new instances and gives no further log entries.

After restarting the metadata agents they are working again until a new leader change of ovn-sb-db.

We are using neutron-ovn-metadata-agent version 18.1.2.dev77

We are also affected with all environments we are running. The most recent one is running wallaby release. We tried to raise the ovsdb_timeout to 30 but that is not the solution for us.
Yesterday morning we ran into the problem again after leader change of ovn-sb-db:

from neutron-ovn-metadata-agent.log:
2022-03-09 07:58:23.803 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.804 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.806 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.807 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.72:6642: connection closed by client
2022-03-09 07:58:23.808 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 25 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:23.809 24 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connecting...
2022-03-09 07:58:23.810 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connected
2022-03-09 07:58:23.837 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: clustered database server is not cluster leader; trying another server
2022-03-09 07:58:23.839 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.74:6642: connection closed by client
2022-03-09 07:58:24.840 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connecting...
2022-03-09 07:58:24.841 8 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:172.16.4.73:6642: connected

after the reconnect to the new leader metadata service is not delivering any metadata to new instances and gives no further log entries.

After restarting the metadata agents they are working again until a new leader change of ovn-sb-db.

We are using neutron-ovn-metadata-agent version 18.1.2.dev77

Revision history for this message

Ivan Zhang (sail4dream) wrote on 2022-03-25:

#6

is this the same bug described in https://bugs.launchpad.net/neutron/+bug/1952550 ?

Revision history for this message

Tiago Pires (tiagohp) wrote on 2022-05-12:

#7

Hi all,

I have a similar issue with Ussuri also, in my case the ovsdb_timeout is 180 and I'm seeing the following logs on the compute node side every time the OVN Central change the leader because of the snapshot process. It will break the SB connections with the metadata agents and it will do a full sync again after the metadata agent finds the new leader:

May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.316 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599699]: 2022-05-11 21:05:11.318 1599699 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.318 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: connection closed by client
<ommited>
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.332 1599679 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:124
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.339 1599679 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.354 1599679 DEBUG neutron.agent.ovn.metadata.agent [-] Provisioning metadata for network 46cc279b-a6fc-41d8-b4f9-d161bf7f9ef4 provision_datapath /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:392

Does someone have a workaround?

More about the metadata agent process here:
https://docs.openstack.org/networking-ovn/latest/contributor/design/metadata_api.html#:~:text=neutron%2Dovn%2Dmetadata%2Dagent%20is%20the%20process%20that%20will,reach%20the%20appropriate%20host%20network.

Tiago Pires

Hi all,

I have a similar issue with Ussuri also, in my case the ovsdb_timeout is 180 and I'm seeing the following logs on the compute node side every time the OVN Central change the leader because of the snapshot process. It will break the SB connections with the metadata agents and it will do a full sync again after the metadata agent finds the new leader:

May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.316 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599699]: 2022-05-11 21:05:11.318 1599699 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: clustered database server is not cluster leader; trying another server
May 11 21:05:11 <ommited> neutron-ovn-metadata-agent[1599694]: 2022-05-11 21:05:11.318 1599694 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:10.2X.3X.25:6642: connection closed by client
<ommited>
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.332 1599679 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:124
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.339 1599679 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
May 11 21:05:21 <ommited> neutron-ovn-metadata-agent[1599679]: 2022-05-11 21:05:21.354 1599679 DEBUG neutron.agent.ovn.metadata.agent [-] Provisioning metadata for network 46cc279b-a6fc-41d8-b4f9-d161bf7f9ef4 provision_datapath /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:392

Does someone have a workaround?

More about the metadata agent process here: 
https://docs.openstack.org/networking-ovn/latest/contributor/design/metadata_api.html#:~:text=neutron%2Dovn%2Dmetadata%2Dagent%20is%20the%20process%20that%20will,reach%20the%20appropriate%20host%20network.

Tiago Pires

Revision history for this message

Nikolay Fedorov (jingvar) wrote on 2022-05-23:

#8

We use Xena with
ovs 2.17.1.post1
ovsdbapp 1.16.0
and have new problem
when Raft happens, neutron-server (ml2 plugin) gets ovsdb timeout (180s by default).
Seems it doesn't catch changing leader and doesn't reconnect.

Revision history for this message

Satish Patel (satish-txt) wrote on 2022-07-22:

#9

Why this bug still undecided even after many folks shared same story?

Revision history for this message

Alan Baghumian (alanbach) wrote on 2022-08-18:

#10

I have personally experience this with my Openstack Focal/Xena based cluster and have another client reporting the same issue.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-02-21:

#11

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status:	New → Confirmed

	Status	Importance	Assigned to
Ubuntu Cloud Archive	New	Undecided	Unassigned
networking-ovn	New	Undecided	Unassigned
neutron (Ubuntu)	Confirmed	Undecided	Unassigned

networking-ovn

neutron-ovn-metadata-agent does not respond on network until restarted after SB disconnects

Bug Description

Other bug subscribers

Remote bug watches