This indeed seems to be related to agent_down_time being default value. What I tested was creating new ports, which calls ping_all_chassis() with the following stacktrace: 2021-08-24 19:12:09.015 30 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-f7fed993-0b19-4a89-ae9a-eda1c6d1f4e4 b8b76d39176a400d8652a013a5da1b05 78368d2238ea45b28e818918b1da7efd - - -] File "/var/lib/kolla/venv/lib/pytFile "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 42, in __thread_body File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 63, in wrap_bootstrap_inner File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_thread.py", line 122, in run File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_utils.py", line 52, in run File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/ovo_rpc.py", line 120, in dispatch_events File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 245, in push File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 251, in _push File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 250, in get_resource_versions File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 226, in get_resource_versions File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 222, in _check_expiration File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 211, in _update_consumer_versions File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 466, in get_agents_resource_versions File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 453, in _get_agents_considered_for_versions File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1007, in fn File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1072, in get_agents File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1040, in ping_all_chassis So that's not every API call, but due to neutron load increasing by chassis_private updates it quickly turned into update on almost every port creation. Now for the second part, I've added the following patch to see how long calls to self.notify_handler.notify() take: diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py index 293eba7605..335e6ea696 100644 --- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py +++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py @@ -14,6 +14,7 @@ import abc import datetime +import timeit from neutron_lib import context as neutron_context from neutron_lib.plugins import constants @@ -571,7 +572,10 @@ class OvnIdlDistributedLock(BaseOvnIdl): '(table: %(table)s)', {'node': self._node_uuid, 'hostname': CONF.host, 'event': event, 'row': row.uuid, 'table': row._table.name}) + start = timeit.default_timer() self.notify_handler.notify(event, row, updates) + LOG.info("KK: processing time for table %(table)s: %(seconds)s", + {"table": row._table.name, "seconds": timeit.default_timer() - start}) @abc.abstractmethod def post_connect(self): This is a snapshot from one of the servers: KK: processing time for table Chassis_Private: 4.142499528825283e-05 KK: processing time for table Chassis_Private: 4.569196607917547e-05 KK: processing time for table Chassis_Private: 5.8398989494889975e-05 KK: processing time for table Chassis_Private: 4.1155959479510784e-05 KK: processing time for table Chassis_Private: 6.496300920844078e-05 KK: processing time for table Chassis_Private: 4.8479996621608734e-05 KK: processing time for table Chassis_Private: 5.863799015060067e-05 KK: processing time for table Chassis_Private: 3.99000127799809e-05 KK: processing time for table Chassis_Private: 6.0677994042634964e-05 KK: processing time for table Chassis_Private: 3.629102138802409e-05 KK: processing time for table Chassis_Private: 4.8089015763252974e-05 KK: processing time for table Chassis_Private: 3.909901715815067e-05 KK: processing time for table Chassis_Private: 8.133298251777887e-05 KK: processing time for table Chassis_Private: 2.78400257229805e-05 KK: processing time for table Chassis_Private: 5.877105286344886e-05 KK: processing time for table Chassis_Private: 4.8115034587681293e-05 KK: processing time for table Chassis_Private: 5.627400241792202e-05 KK: processing time for table Chassis_Private: 2.924696309491992e-05 KK: processing time for table Chassis_Private: 8.873397018760443e-05 KK: processing time for table Port_Binding: 5.023000994697213e-05 KK: processing time for table Chassis_Private: 3.7371995858848095e-05 KK: processing time for table Chassis_Private: 0.00013784202747046947 KK: processing time for table Chassis_Private: 5.2426010370254517e-05 KK: processing time for table Port_Group: 2.144702011719346e-05 KK: processing time for table Logical_Switch: 2.6123016141355038e-05 KK: processing time for table Logical_Switch_Port: 1.9197002984583378e-05 KK: processing time for table Chassis_Private: 0.00011140096466988325 KK: processing time for table Chassis_Private: 6.18079793639481e-05 KK: processing time for table Chassis_Private: 5.5163982324302197e-05 KK: processing time for table Chassis_Private: 2.8703012503683567e-05 KK: processing time for table Chassis_Private: 3.352697240188718e-05 KK: processing time for table Chassis_Private: 4.7340989112854004e-05 So somehow hash ring related code running in api worker takes pretty long time sending notifications on table updates. Is this an async call, or does it wait for receiver to process notification? Even if that were the case AgentCache processing should not take that much time in general unless entire neutron server is heavily overloaded?