[ovn] neutron api worker gets overloaded processing chassis_private updates
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Undecided
|
Unassigned |
Bug Description
This was tested with stable/ussuri branch with https:/
The test setup was 3 controllers, each with 10 api workers and rpc workers, with 250 chassis running ovn-controller. There are 1k networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for routers), 1k routers connected to the same external network, 2k vms (2 vms per network, and all vms additionally connected to a single shared network between them). Northbound DB is 15MB, Southbound DB is 100MB.
When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented. This translates into SB_Global.nb_cfg change which is picked by all ovn-controllers, which in turn update their entry in Chassis_Private, incrementing Chasiss_
After that, southbound ovsdb sends update to neutron either due to https:/
In my testing, when that happened, all neutron api workers stopped processing API requests until all Chassis_Private events were handled which took around 30 seconds on each nb_cfg update. This could be due to controller nodes in test environment not being scaled up properly, but it seems to be a potential scaling issue.
tags: | added: ovn |
> When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented.
This *shouldn't* be the case. We should only be updating nb_cfg when we call ping_all_chassis() which is called from get_agents(). By default bump_nb_cfg is set to False: https:/ /github. com/openstack/ neutron/ blob/19372a3cd8 b4e6e45f707753b 914e133857dd629 /neutron/ plugins/ ml2/drivers/ ovn/mech_ driver/ ovsdb/impl_ idl_ovn. py#L67
Is it possible that you have a low agent_down_time setting and we are pinging very frequently?
If we are bumping nb_cfg other places (or somehow calling ping_all_chassis more frequently than agent_down_time / 2), then that is definitely an additional bug.
> After that, southbound ovsdb sends update to neutron either due to https:/ /review. opendev. org/c/openstack /neutron/ +/752795/ 29/neutron/ plugins/ ml2/drivers/ ovn/mech_ driver/ ovsdb/ovsdb_ monitor. py#249 or https:/ /review. opendev. org/c/openstack /neutron/ +/752795/ 29/neutron/ plugins/ ml2/drivers/ ovn/mech_ driver/ ovsdb/ovsdb_ monitor. py#264 which is then handled by Hash Ring implementation to send update the worker.
The events will be sent to us regardless of the neutron code, as long as we have the table/columnreg istered. When we register those events locally, it is just basically a filter that says "when you get this event, do something with it". We test all of those events for matching regardless of these specific events.
The ChassisAgentDel eteEvent which matches the SB_Global table will look at the updated row and check and see if it is a specific external_ids update and if not, immediately return. Also we should *only* be sent SB_Global events that are external_ids changes, because: https:/ /github. com/openstack/ neutron/ blob/19372a3cd8 b4e6e45f707753b 914e133857dd629 /neutron/ plugins/ ml2/drivers/ ovn/mech_ driver/ ovsdb/ovsdb_ monitor. py#L664 so I wouldn't expect this to generate much load.
ChassisAgentWri teEvent will match on the Chassis_Private and will update the agent of each one created or with an updated nb_cfg--which, if there are 250 agents could take a bit, but 30 seconds would be weird. These events are actually not processed by HashRing because they are "global" events, so they are processed on each server (because they have to update an internally cached agent object).
So we need to find out if each update really is somehow bumping nb_cfg--because that *shouldn't* be happening. It looks like the segments code and trunk port validating code both can call get_agents() a lot which is a bit worrying. But even so, the ping_all_chassis code *should* be looking at the timestamp of the last ping and not pinging if it was within the agent_down_time window.
In any case, 30 seconds of processing even if it just happened on pings (which should be agent_down_time / 2 seconds, which by default would be ~37s) isn't going to be acceptable. I'm happy to look into it, just don't happen to have 250 machines sitting nearby...