Activity log for bug #1951010

Date Who What changed Old value New value Message
2021-11-15 19:34:32 Léo Gillot-Lamure bug added bug
2021-11-15 19:34:32 Léo Gillot-Lamure attachment added Logs showing the evolution of "reported_hosts" https://bugs.launchpad.net/bugs/1951010/+attachment/5541058/+files/investigation_neutron.txt
2021-11-15 19:38:45 Léo Gillot-Lamure description * High level description: Whenever we restart neutron-server, we see a huge number of requests (hundreds of thousands, overwhelming our control plane over the course of ~8 hours) going from neutron-server to nova-api. These requests are related to the segment aggregates, more specifically we see, for each of our hypervisors and for each segment, a GET to [URL] in order to get the aggregate id for the segment and a POST to [URL] in order to (try to) add the hypervisor to the aggregate, which fails because the host already exists there. These calls originate from this "_add_host_to_aggregate" method: https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403 and we see the 'Host %(host)s already exists in aggregate for routed network segment %(segment_id)s' message in our logs. To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following: - When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set. - Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ - This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283 After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate. - Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities: - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message. At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end. - Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server. Included as attachment is an excerpt of logs using a debug line that we added in our code in our test environment, that shows the "reported_hosts" set growing in parallel for each RPC worker process. This whole procedure is happening in parallel for all hosts. The important number of times the same series of calls for the same hypervisor is repeated is due to the fact that the "reported_hosts" set is a simple Python variable, purely local to each RPC worker process. In our case we have 4 neutron-server instances that each have 9 regular RPC workers and 9 state report RPC worker, hence the number of processes is 4×18=72 copies of the variable. We calculated that a rolling restart of our neutron-server instances (say for an update deployment) will ultimately generate: 300 hypervisors × 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000 calls to Nova. * Workaround Included is a minimal patch that we are considering applying to our internal branch that disables the whole "re-register to Nova at each restart" logic, while still keeping the possibility to execute this re-registration by restarting the neutron openvswitch agent on the hypervisors (thanks to the start_flag). * Pre-conditions: what is the initial state of your system? Please consider enumerating resources available in the system, if useful in diagnosing the problem. Who are you? A regular tenant or a super-user? Are you describing service-to-service interaction? We're operating a quite big deployment, our biggest region where this issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider routed networks (we don't do SDN), 4 baremetal control plane nodes with 24 hyperthreads and 250GB RAM each, on which all Openstack servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except Galera MySQL which is alone on another 4 baremetal nodes. * Step-by-step reproduction steps: $ docker restart neutron_server * Expected output: nothing * Actual output: Hundreds of thousands of calls are emitted from neutron-server to nova-api, crippling our control plane for hours. * Version: ** OpenStack version: Ussuri ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2 * Perceived severity: the call flood is basically taking our control plane down due to the load for the first 4 hours, and severely impacting its performance for the 4 next hours, everytime we restart neutron-server. As a result we are severely restrained in our capacity to restart neutron-server, which should be a non-event. * High level description: Whenever we restart neutron-server, we see a huge number of requests (hundreds of thousands, overwhelming our control plane over the course of ~8 hours) going from neutron-server to nova-api. These requests are related to the segment aggregates, more specifically we see, for each of our hypervisors and for each segment, a GET to [URL] in order to get the aggregate id for the segment and a POST to [URL] in order to (try to) add the hypervisor to the aggregate, which fails because the host already exists there. These calls originate from this "_add_host_to_aggregate" method: https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403 and we see the 'Host %(host)s already exists in aggregate for routed network segment %(segment_id)s' message in our logs. To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following: - When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set. - Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ - This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283. After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate. - Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities: - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message. At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end. - Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server. Included as attachment is an excerpt of logs using a debug line that we added in our code in our test environment, that shows the "reported_hosts" set growing in parallel for each RPC worker process. This whole procedure is happening in parallel for all hosts. The important number of times the same series of calls for the same hypervisor is repeated is due to the fact that the "reported_hosts" set is a simple Python variable, purely local to each RPC worker process. In our case we have 4 neutron-server instances that each have 9 regular RPC workers and 9 state report RPC worker, hence the number of processes is 4×18=72 copies of the variable. We calculated that a rolling restart of our neutron-server instances (say for an update deployment) will ultimately generate 300 hypervisors × 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000 calls to Nova. * Workaround Included is a minimal patch that we are considering applying to our internal branch that disables the whole "re-register to Nova at each restart" logic, while still keeping the possibility to execute this re-registration by restarting the neutron openvswitch agent on the hypervisors (thanks to the start_flag). * Pre-conditions: what is the initial state of your system? Please consider enumerating resources available in the system, if useful in diagnosing the problem. Who are you? A regular tenant or a super-user? Are you describing service-to-service interaction? We're operating a quite big deployment, our biggest region where this issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider routed networks (we don't do SDN), 4 baremetal control plane nodes with 24 hyperthreads and 250GB RAM each, on which all Openstack servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except MySQL Galera which is alone on another 4 baremetal nodes. * Step-by-step reproduction steps: $ docker restart neutron_server * Expected output: nothing * Actual output: Hundreds of thousands of calls are emitted from neutron-server to nova-api, crippling our control plane for hours. * Version: ** OpenStack version: Ussuri ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2 * Perceived severity: the call flood is basically taking our control plane down due to the load for the first 4 hours, and severely impacting its performance for the 4 next hours, everytime we restart neutron-server. As a result we are severely restrained in our capacity to restart neutron-server, which should be a non-event.
2021-11-15 19:42:14 Léo Gillot-Lamure attachment added Our minimal workaround https://bugs.launchpad.net/neutron/+bug/1951010/+attachment/5541059/+files/0001-Prevent-flooding-Nova-with-segment-aggregate-updates.patch
2021-11-17 12:31:33 Hongbin Lu neutron: status New Confirmed
2021-11-17 12:31:43 Hongbin Lu neutron: importance Undecided High
2021-11-21 15:58:31 gustavo panizzo bug added subscriber gustavo panizzo
2021-11-23 14:27:43 Miguel Lavalle neutron: assignee Miguel Lavalle (minsel)