neutron-ovn-metadata-agent does not respond on network until restarted after SB disconnects
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
New
|
Undecided
|
Unassigned | ||
networking-ovn |
New
|
Undecided
|
Unassigned | ||
neutron (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Hi,
We are running OpenStack Xena in a highly dynamic environment with a few hundreds tenant networks and projects and are using OVN setup in a 3-node cluster for the northbound and southbound databases as well as the Northd daemon.
With a few hundreds instances, we started to notice that when starting new instances, the instances get their DHCP information via OVN Openflow rules, but cloud-init was not installing the initial configuration (easily recognizable by looking at the log: the console prompt at the end show that the instance did not have its name configured, nor the ssh public key).
After some investigations we pinpointed noticed that on these instances the neutron-
It seems the instance metadata server is not reachable specifically when we are starting an instance in a new network/subnet.
We then enabled the debug logs on the metadata agent and only noticed that the agents are being disconnected from the SB DB then reconnected immediately but without any additionnal relevant log messages.
First we looked at our OVN cluster status and noticed that the cluster was flapping very frequently (changing NB, SB and northd leaders) and fixed that as well with adjusting the inactivity probe and election timers.
Since then the OVN cluster is pretty stable and only changes leader (and increment term) when the SB leader voluntarily transfers leadership to take a snapshot of the database, every few hours according to the "Last election started" timer.
It seems the neutron-
My understanding is that the metadata agent usually creates a new haproxy instance in a dedicated namespace on the host where the instance is created, but fails to do so as soon as it's disconnected from the SB DB, even after reconnecting to the new SB leader (almost instantly)
The real issue here is that there is no logs other than the usual disconnects/
The "openstack network agent list" reports the agent down as well when this occurs and for now we had no other choice than restarting the metadata agent every 5 minutes to somewhat make this issue invisible to our end users.
Is anyone already having this issue ?
We are running Openstack Xena deployed using kolla-ansible, using container images (Centos 8 Stream+ Openstack source) built with kolla. The relevant versions are :
Python neutron 19.0.1.dev8
Python ovs 2.13.3
Python ovsdbapp 1.12.0
haproxy 1.8.27-2.el8
openvswitch2.
neutron_
[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
ovsdb_timeout = 10
[ovn]
ovn_nb_connection = tcp:XXX:
ovn_sb_connection = tcp:XXX:
ovn_metadata_
We literally have no other logs than the OVN disconnection and reconnection lines :
2021-12-08 09:11:21.030 23 INFO ovsdbapp.
2021-12-08 09:11:21.032 23 INFO ovsdbapp.
2021-12-08 09:11:21.032 23 INFO ovsdbapp.
2021-12-08 09:11:21.033 7 INFO ovsdbapp.
2021-12-08 09:11:21.034 7 INFO ovsdbapp.
2021-12-08 09:11:22.034 7 INFO ovsdbapp.
2021-12-08 09:11:22.035 7 INFO ovsdbapp.
2021-12-08 09:11:23.034 23 INFO ovsdbapp.
2021-12-08 09:11:23.034 23 INFO ovsdbapp.
we have the same.