[ovn] neutron api worker gets overloaded processing chassis_private updates

Bug #1940950 reported by Krzysztof Klimonda
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

This was tested with stable/ussuri branch with https://review.opendev.org/c/openstack/neutron/+/752795/ backported.

The test setup was 3 controllers, each with 10 api workers and rpc workers, with 250 chassis running ovn-controller. There are 1k networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for routers), 1k routers connected to the same external network, 2k vms (2 vms per network, and all vms additionally connected to a single shared network between them). Northbound DB is 15MB, Southbound DB is 100MB.

When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented. This translates into SB_Global.nb_cfg change which is picked by all ovn-controllers, which in turn update their entry in Chassis_Private, incrementing Chasiss_Private.nb_cfg.

After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker.

In my testing, when that happened, all neutron api workers stopped processing API requests until all Chassis_Private events were handled which took around 30 seconds on each nb_cfg update. This could be due to controller nodes in test environment not being scaled up properly, but it seems to be a potential scaling issue.

Tags: ovn
tags: added: ovn
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

> When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented.

This *shouldn't* be the case. We should only be updating nb_cfg when we call ping_all_chassis() which is called from get_agents(). By default bump_nb_cfg is set to False: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L67

Is it possible that you have a low agent_down_time setting and we are pinging very frequently?

If we are bumping nb_cfg other places (or somehow calling ping_all_chassis more frequently than agent_down_time / 2), then that is definitely an additional bug.

> After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker.

The events will be sent to us regardless of the neutron code, as long as we have the table/columnregistered. When we register those events locally, it is just basically a filter that says "when you get this event, do something with it". We test all of those events for matching regardless of these specific events.

The ChassisAgentDeleteEvent which matches the SB_Global table will look at the updated row and check and see if it is a specific external_ids update and if not, immediately return. Also we should *only* be sent SB_Global events that are external_ids changes, because: https://github.com/openstack/neutron/blob/19372a3cd8b4e6e45f707753b914e133857dd629/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L664 so I wouldn't expect this to generate much load.

ChassisAgentWriteEvent will match on the Chassis_Private and will update the agent of each one created or with an updated nb_cfg--which, if there are 250 agents could take a bit, but 30 seconds would be weird. These events are actually not processed by HashRing because they are "global" events, so they are processed on each server (because they have to update an internally cached agent object).

So we need to find out if each update really is somehow bumping nb_cfg--because that *shouldn't* be happening. It looks like the segments code and trunk port validating code both can call get_agents() a lot which is a bit worrying. But even so, the ping_all_chassis code *should* be looking at the timestamp of the last ping and not pinging if it was within the agent_down_time window.

In any case, 30 seconds of processing even if it just happened on pings (which should be agent_down_time / 2 seconds, which by default would be ~37s) isn't going to be acceptable. I'm happy to look into it, just don't happen to have 250 machines sitting nearby...

Revision history for this message
Krzysztof Klimonda (kklimonda) wrote :
Download full text (6.4 KiB)

This indeed seems to be related to agent_down_time being default value.

What I tested was creating new ports, which calls ping_all_chassis() with the following stacktrace:

2021-08-24 19:12:09.015 30 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-f7fed993-0b19-4a89-ae9a-eda1c6d1f4e4 b8b76d39176a400d8652a013a5da1b05 78368d2238ea45b28e818918b1da7efd - - -] File "/var/lib/kolla/venv/lib/pytFile "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 42, in __thread_body
File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/green/thread.py", line 63, in wrap_bootstrap_inner
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_thread.py", line 122, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/futurist/_utils.py", line 52, in run
File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/ovo_rpc.py", line 120, in dispatch_events
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 245, in push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/handlers/resources_rpc.py", line 251, in _push
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 250, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 226, in get_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 222, in _check_expiration
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/rpc/callbacks/version_manager.py", line 211, in _update_consumer_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 466, in get_agents_resource_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agents_db.py", line 453, in _get_agents_considered_for_versions
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1007, in fn
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1072, in get_agents
File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1040, in ping_all_chassis

So that's not every API call, but due to neutron load increasing by chassis_private updates it quickly turned into update on almost every port creation.

Now for the second part, I've added the following patch to see how long calls to self.notify_handler.notify() take:

diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
index 293eba7605..335e6ea696 100644
--- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
+++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ov...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.