Comment 2 for bug 1940950

Revision history for this message
Krzysztof Klimonda (kklimonda) wrote :

This indeed seems to be related to agent_down_time being default value.

What I tested was creating new ports, which calls ping_all_chassis() with the following stacktrace:

2021-08-24 19:12:09.015 30 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-f7fed993-0b19-4a89-ae9a-eda1c6d1f4e4 b8b76d39176a400d8652a013a5da1b05 78368d2238ea45b28e818918b1da7efd - - -] File "/var/lib/kolla/venv/lib/pytFile "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 42, in __thread_body
File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 63, in wrap_bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_thread.py", line 122, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_utils.py", line 52, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/ovo_rpc.py", line 120, in dispatch_events
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 245, in push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 251, in _push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 250, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 226, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 222, in _check_expiration
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 211, in _update_consumer_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 466, in get_agents_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 453, in _get_agents_considered_for_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1007, in fn
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1072, in get_agents
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1040, in ping_all_chassis

So that's not every API call, but due to neutron load increasing by chassis_private updates it quickly turned into update on almost every port creation.

Now for the second part, I've added the following patch to see how long calls to self.notify_handler.notify() take:

diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
index 293eba7605..335e6ea696 100644
--- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
+++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
@@ -14,6 +14,7 @@

 import abc
 import datetime
+import timeit

 from neutron_lib import context as neutron_context
 from neutron_lib.plugins import constants
@@ -571,7 +572,10 @@ class OvnIdlDistributedLock(BaseOvnIdl):
                   '(table: %(table)s)',
                   {'node': self._node_uuid, 'hostname': CONF.host,
                    'event': event, 'row': row.uuid, 'table': row._table.name})
+ start = timeit.default_timer()
         self.notify_handler.notify(event, row, updates)
+ LOG.info("KK: processing time for table %(table)s: %(seconds)s",
+ {"table": row._table.name, "seconds": timeit.default_timer() - start})

     @abc.abstractmethod
     def post_connect(self):

This is a snapshot from one of the servers:

KK: processing time for table Chassis_Private: 4.142499528825283e-05
KK: processing time for table Chassis_Private: 4.569196607917547e-05
KK: processing time for table Chassis_Private: 5.8398989494889975e-05
KK: processing time for table Chassis_Private: 4.1155959479510784e-05
KK: processing time for table Chassis_Private: 6.496300920844078e-05
KK: processing time for table Chassis_Private: 4.8479996621608734e-05
KK: processing time for table Chassis_Private: 5.863799015060067e-05
KK: processing time for table Chassis_Private: 3.99000127799809e-05
KK: processing time for table Chassis_Private: 6.0677994042634964e-05
KK: processing time for table Chassis_Private: 3.629102138802409e-05
KK: processing time for table Chassis_Private: 4.8089015763252974e-05
KK: processing time for table Chassis_Private: 3.909901715815067e-05
KK: processing time for table Chassis_Private: 8.133298251777887e-05
KK: processing time for table Chassis_Private: 2.78400257229805e-05
KK: processing time for table Chassis_Private: 5.877105286344886e-05
KK: processing time for table Chassis_Private: 4.8115034587681293e-05
KK: processing time for table Chassis_Private: 5.627400241792202e-05
KK: processing time for table Chassis_Private: 2.924696309491992e-05
KK: processing time for table Chassis_Private: 8.873397018760443e-05
KK: processing time for table Port_Binding: 5.023000994697213e-05
KK: processing time for table Chassis_Private: 3.7371995858848095e-05
KK: processing time for table Chassis_Private: 0.00013784202747046947
KK: processing time for table Chassis_Private: 5.2426010370254517e-05
KK: processing time for table Port_Group: 2.144702011719346e-05
KK: processing time for table Logical_Switch: 2.6123016141355038e-05
KK: processing time for table Logical_Switch_Port: 1.9197002984583378e-05
KK: processing time for table Chassis_Private: 0.00011140096466988325
KK: processing time for table Chassis_Private: 6.18079793639481e-05
KK: processing time for table Chassis_Private: 5.5163982324302197e-05
KK: processing time for table Chassis_Private: 2.8703012503683567e-05
KK: processing time for table Chassis_Private: 3.352697240188718e-05
KK: processing time for table Chassis_Private: 4.7340989112854004e-05

So somehow hash ring related code running in api worker takes pretty long time sending notifications on table updates. Is this an async call, or does it wait for receiver to process notification? Even if that were the case AgentCache processing should not take that much time in general unless entire neutron server is heavily overloaded?