rpc response timeout for agent report_state is not possible
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Tobias Urdin |
Bug Description
When hosting a large amount of routers and/or networks the RPC calls from the agents can take a long time which requires us to increase the rpc_response_
This has the side effect that if a rabbitmq or neutron-server is restarted all agents that is currently reporting there will hang for a long time until report_state times out, during this time neutron-server has not got any reports causing it to set the agent as down.
When it times out and tries again the reporting will succeed but a full sync will be triggered for all agents that was previously dead. This in itself can cause a very high load on the control plane.
Consider the fact that a configuration change is deployed using tooling to all neutron-server nodes which is restarted, all agents will die, when they either 1) come back after rpc_response_
We should have a configuration option that only applies to the rpc timeout for the report_state RPC call from agents because that could be lowered to be within the bounds of the agent not being seen as down.
The old behavior can be kept by simply falling back to rpc_response_
Changed in neutron: | |
status: | Opinion → In Progress |
Changed in neutron: | |
importance: | Undecided → Medium |
assignee: | nobody → Tobias Urdin (tobias-urdin) |
Fix proposed to branch: master /review. opendev. org/c/openstack /neutron/ +/815310
Review: https:/