Restarting Neutron floods Nova with segment aggregates calls

Bug #1951010 reported by Léo Gillot-Lamure
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
High
Miguel Lavalle

Bug Description

* High level description:

Whenever we restart neutron-server, we see a huge number of requests (hundreds of thousands, overwhelming our control plane over the course of ~8 hours) going from neutron-server to nova-api. These requests are related to the segment aggregates, more specifically we see, for each of our hypervisors and for each segment, a GET to [URL] in order to get the aggregate id for the segment and a POST to [URL] in order to (try to) add the hypervisor to the aggregate, which fails because the host already exists there. These calls originate from this "_add_host_to_aggregate" method: https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403 and we see the 'Host %(host)s already exists in aggregate for routed network segment %(segment_id)s' message in our logs.

To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following:
- When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set.
- Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ
- This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283. After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate.
- Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities:
  - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent
  - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message.
  At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end.
- Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server.

Included as attachment is an excerpt of logs using a debug line that we added in our code in our test environment, that shows the "reported_hosts" set growing in parallel for each RPC worker process.

This whole procedure is happening in parallel for all hosts. The important number of times the same series of calls for the same hypervisor is repeated is due to the fact that the "reported_hosts" set is a simple Python variable, purely local to each RPC worker process. In our case we have 4 neutron-server instances that each have 9 regular RPC workers and 9 state report RPC worker, hence the number of processes is 4×18=72 copies of the variable.

We calculated that a rolling restart of our neutron-server instances (say for an update deployment) will ultimately generate 300 hypervisors × 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000 calls to Nova.

* Workaround

Included is a minimal patch that we are considering applying to our internal branch that disables the whole "re-register to Nova at each restart" logic, while still keeping the possibility to execute this re-registration by restarting the neutron openvswitch agent on the hypervisors (thanks to the start_flag).

* Pre-conditions: what is the initial state of your system? Please consider enumerating resources available in the system, if useful in diagnosing the problem. Who are you? A regular tenant or a super-user? Are you describing service-to-service interaction?

We're operating a quite big deployment, our biggest region where this issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider routed networks (we don't do SDN), 4 baremetal control plane nodes with 24 hyperthreads and 250GB RAM each, on which all Openstack servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except MySQL Galera which is alone on another 4 baremetal nodes.

* Step-by-step reproduction steps:

$ docker restart neutron_server

* Expected output: nothing

* Actual output: Hundreds of thousands of calls are emitted from neutron-server to nova-api, crippling our control plane for hours.

* Version:
  ** OpenStack version: Ussuri
  ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers
  ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2

* Perceived severity: the call flood is basically taking our control plane down due to the load for the first 4 hours, and severely impacting its performance for the 4 next hours, everytime we restart neutron-server. As a result we are severely restrained in our capacity to restart neutron-server, which should be a non-event.

Revision history for this message
Léo Gillot-Lamure (navaati-sg) wrote :
description: updated
Revision history for this message
Léo Gillot-Lamure (navaati-sg) wrote :
Hongbin Lu (hongbin.lu)
Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

When Neutron server is starting, the in-memory variable ‘reported_hosts’ is an empty set. Whenever the segments reported by agents are already been initialized; The algorithm will execute the process of reporting mappings host/segments to Nova by creating for each segment an aggregate and by adding hosts.

We can notice some points regarding the current process:

 - It only reports mappings but does not report mapping that get removed (hosts or segments deleted)
 - It reports mappings that already exist which lead to a terrible flood in Nova for large deployments using segments.

Even if mappings and aggregates are persistent, it seems reasonable during a restart of Neutron to execute such rebuilding process.

To fix the issue, one suggestion is to initialize an in-memory datastructure shared by workers with hosts/segments mappings. When Neutron starts, the first worker that could acquire the lock would retrieves from database the mappings. The process could then compare and validate mappings received by agents to report changes if needed.

If there are no changes on a mapping, we could expect to have the related aggregate also in-sync.

Miguel, does that sounds reasonable for you? I can see you assigned to the ticket would you share with us your thinking?

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Looks like the initial issue has been resolved with:

  https://bugs.launchpad.net/neutron/+bug/1952730

By adding that extra information to the messages sent by producer, consumers could discard messages. In such logic, we could ask ourselves why the producer couldn't have made the condition by itself, that to avoid sending an unnecessary message to the queue.

Like this, this patch looks a bit hacky in term of maintainability but it seems legit as it resolves the flood I guess.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Shall I make this a duplicate of https://bugs.launchpad.net/neutron/+bug/1952730 and close?

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

@Lajos, me feeling is that, bug/1952730 is a workaround as it prevents the flood to happen, the real fix that closes this bug is here bug/1959750.

Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

I may missing something Lajos, but both are related theyr are folowing same path. one is fixing the root cause that is creating the flood, the other is adding condition to avoid having the flood cqlled too often. if you change network configuration within the segment, the condition will fail and flood will happen.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.