Bug #1940950 “[ovn] neutron api worker gets overloaded processin...” : Bugs : neutron

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2021-08-24

tags:

added: ovn

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2021-08-24:

#1

> When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented.

This *shouldn't* be the case. We should only be updating nb_cfg when we call ping_all_chassis() which is called from get_agents(). By default bump_nb_cfg is set to False: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L67

Is it possible that you have a low agent_down_time setting and we are pinging very frequently?

If we are bumping nb_cfg other places (or somehow calling ping_all_chassis more frequently than agent_down_time / 2), then that is definitely an additional bug.

> After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker.

The events will be sent to us regardless of the neutron code, as long as we have the table/columnregistered. When we register those events locally, it is just basically a filter that says "when you get this event, do something with it". We test all of those events for matching regardless of these specific events.

The ChassisAgentDeleteEvent which matches the SB_Global table will look at the updated row and check and see if it is a specific external_ids update and if not, immediately return. Also we should *only* be sent SB_Global events that are external_ids changes, because: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L664 so I wouldn't expect this to generate much load.

ChassisAgentWriteEvent will match on the Chassis_Private and will update the agent of each one created or with an updated nb_cfg--which, if there are 250 agents could take a bit, but 30 seconds would be weird. These events are actually not processed by HashRing because they are "global" events, so they are processed on each server (because they have to update an internally cached agent object).

So we need to find out if each update really is somehow bumping nb_cfg--because that *shouldn't* be happening. It looks like the segments code and trunk port validating code both can call get_agents() a lot which is a bit worrying. But even so, the ping_all_chassis code *should* be looking at the timestamp of the last ping and not pinging if it was within the agent_down_time window.

In any case, 30 seconds of processing even if it just happened on pings (which should be agent_down_time / 2 seconds, which by default would be ~37s) isn't going to be acceptable. I'm happy to look into it, just don't happen to have 250 machines sitting nearby...

> When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented.

This *shouldn't* be the case. We should only be updating nb_cfg when we call ping_all_chassis() which is called from get_agents(). By default bump_nb_cfg is set to False: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L67

Is it possible that you have a low agent_down_time setting and we are pinging very frequently?

If we are bumping nb_cfg other places (or somehow calling ping_all_chassis more frequently than agent_down_time / 2), then that is definitely an additional bug.

> After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker.

The events will be sent to us regardless of the neutron code, as long as we have the table/columnregistered. When we register those events locally, it is just basically a filter that says "when you get this event, do something with it". We test all of those events for matching regardless of these specific events.

The ChassisAgentDeleteEvent which matches the SB_Global table will look at the updated row and check and see if it is a specific external_ids update and if not, immediately return. Also we should *only* be sent SB_Global events that are external_ids changes, because: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L664 so I wouldn't expect this to generate much load.

ChassisAgentWriteEvent will match on the Chassis_Private and will update the agent of each one created or with an updated nb_cfg--which, if there are 250 agents could take a bit, but 30 seconds would be weird. These events are actually not processed by HashRing because they are "global" events, so they are processed on each server (because they have to update an internally cached agent object).

So we need to find out if each update really is somehow bumping nb_cfg--because that *shouldn't* be happening. It looks like the segments code and trunk port validating code both can call get_agents() a lot which is a bit worrying. But even so, the ping_all_chassis code *should* be looking at the timestamp of the last ping and not pinging if it was within the agent_down_time window.

In any case, 30 seconds of processing even if it just happened on pings (which should be agent_down_time / 2 seconds, which by default would be ~37s) isn't going to be acceptable. I'm happy to look into it, just don't happen to have 250 machines sitting nearby...

Revision history for this message

Krzysztof Klimonda (kklimonda) wrote on 2021-08-24:

#2

Download full text (6.4 KiB)

This indeed seems to be related to agent_down_time being default value.

What I tested was creating new ports, which calls ping_all_chassis() with the following stacktrace:

2021-08-24 19:12:09.015 30 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-f7fed993-0b19-4a89-ae9a-eda1c6d1f4e4 b8b76d39176a400d8652a013a5da1b05 78368d2238ea45b28e818918b1da7efd - - -] File "/var/lib/kolla/venv/lib/pytFile "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 42, in __thread_body
File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 63, in wrap_bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_thread.py", line 122, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_utils.py", line 52, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/ovo_rpc.py", line 120, in dispatch_events
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 245, in push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 251, in _push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 250, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 226, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 222, in _check_expiration
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 211, in _update_consumer_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 466, in get_agents_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 453, in _get_agents_considered_for_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1007, in fn
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1072, in get_agents
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1040, in ping_all_chassis

So that's not every API call, but due to neutron load increasing by chassis_private updates it quickly turned into update on almost every port creation.

Now for the second part, I've added the following patch to see how long calls to self.notify_handler.notify() take:

diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
index 293eba7605..335e6ea696 100644
--- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
+++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ov...

This indeed seems to be related to agent_down_time being default value.

What I tested was creating new ports, which calls ping_all_chassis() with the following stacktrace:

2021-08-24 19:12:09.015 30 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-f7fed993-0b19-4a89-ae9a-eda1c6d1f4e4 b8b76d39176a400d8652a013a5da1b05 78368d2238ea45b28e818918b1da7efd - - -]   File "/var/lib/kolla/venv/lib/pytFile "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 42, in __thread_body
File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 63, in wrap_bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_thread.py", line 122, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_utils.py", line 52, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/ovo_rpc.py", line 120, in dispatch_events
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 245, in push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 251, in _push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 250, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 226, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 222, in _check_expiration
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 211, in _update_consumer_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 466, in get_agents_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 453, in _get_agents_considered_for_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1007, in fn
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1072, in get_agents
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1040, in ping_all_chassis

So that's not every API call, but due to neutron load increasing by chassis_private updates it quickly turned into update on almost every port creation.

Now for the second part, I've added the following patch to see how long calls to self.notify_handler.notify() take:

diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
index 293eba7605..335e6ea696 100644
--- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
+++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
@@ -14,6 +14,7 @@

import abc
 import datetime
+import timeit

from neutron_lib import context as neutron_context
 from neutron_lib.plugins import constants
@@ -571,7 +572,10 @@ class OvnIdlDistributedLock(BaseOvnIdl):
                   '(table: %(table)s)',
                   {'node': self._node_uuid, 'hostname': CONF.host,
                    'event': event, 'row': row.uuid, 'table': row._table.name})
+        start = timeit.default_timer()
         self.notify_handler.notify(event, row, updates)
+        LOG.info("KK: processing time for table %(table)s: %(seconds)s",
+                 {"table": row._table.name, "seconds": timeit.default_timer() - start})

@abc.abstractmethod
     def post_connect(self):

This is a snapshot from one of the servers:

KK: processing time for table Chassis_Private: 4.142499528825283e-05
KK: processing time for table Chassis_Private: 4.569196607917547e-05
KK: processing time for table Chassis_Private: 5.8398989494889975e-05
KK: processing time for table Chassis_Private: 4.1155959479510784e-05
KK: processing time for table Chassis_Private: 6.496300920844078e-05
KK: processing time for table Chassis_Private: 4.8479996621608734e-05
KK: processing time for table Chassis_Private: 5.863799015060067e-05
KK: processing time for table Chassis_Private: 3.99000127799809e-05
KK: processing time for table Chassis_Private: 6.0677994042634964e-05
KK: processing time for table Chassis_Private: 3.629102138802409e-05
KK: processing time for table Chassis_Private: 4.8089015763252974e-05
KK: processing time for table Chassis_Private: 3.909901715815067e-05
KK: processing time for table Chassis_Private: 8.133298251777887e-05
KK: processing time for table Chassis_Private: 2.78400257229805e-05
KK: processing time for table Chassis_Private: 5.877105286344886e-05
KK: processing time for table Chassis_Private: 4.8115034587681293e-05
KK: processing time for table Chassis_Private: 5.627400241792202e-05
KK: processing time for table Chassis_Private: 2.924696309491992e-05
KK: processing time for table Chassis_Private: 8.873397018760443e-05
KK: processing time for table Port_Binding: 5.023000994697213e-05
KK: processing time for table Chassis_Private: 3.7371995858848095e-05
KK: processing time for table Chassis_Private: 0.00013784202747046947
KK: processing time for table Chassis_Private: 5.2426010370254517e-05
KK: processing time for table Port_Group: 2.144702011719346e-05
KK: processing time for table Logical_Switch: 2.6123016141355038e-05
KK: processing time for table Logical_Switch_Port: 1.9197002984583378e-05
KK: processing time for table Chassis_Private: 0.00011140096466988325
KK: processing time for table Chassis_Private: 6.18079793639481e-05
KK: processing time for table Chassis_Private: 5.5163982324302197e-05
KK: processing time for table Chassis_Private: 2.8703012503683567e-05
KK: processing time for table Chassis_Private: 3.352697240188718e-05
KK: processing time for table Chassis_Private: 4.7340989112854004e-05

So somehow hash ring related code running in api worker takes pretty long time sending notifications on table updates. Is this an async call, or does it wait for receiver to process notification? Even if that were the case AgentCache processing should not take that much time in general unless entire neutron server is heavily overloaded?

neutron

[ovn] neutron api worker gets overloaded processing chassis_private updates

Bug Description

Other bug subscribers

Remote bug watches