The agent/server communication pattern we use now can lead to cascading failures making the servers unavailable.
The current pattern in our communications between the Neutron server and the agents looks like the following:
Server sends: item <item-uuid> changed
Client receives event.
Client makes a call to the server asking for the item details.
The calls the client makes to the server can be expensive and a server under heavy load can take a long time to start processing the request and/or to fulfill the request. This can trigger a timeout on the agent side, which leads to a retry, or, even worse, a generic fallback to resync the entire state. This creates a thundering herd problem where a server falling behind on requests will be continually stampeded by retries from agents that have timed out by the time the server can respond.
The pattern of agent/server communication needs to be adjusted to assume terrible server response times at a minimum. Optimally, all of the notifications generated by the servers should be adjusted to include all of the relevant information that an agent will need to respond to an event so the only time an agent has to actually call the server is on startup to get initial state.
Yes, we need a good RPC mechanism to avoid the need to sync-back on neutron server.
And probably, implementing step back / circuit breaking patterns on requests back to neutron to mitigate cascading failures.
If we used a single source of incremental IDs we could rely on (redis INCR [1]? or any abstracted client..), we could tag RPC messages with incremental IDs to avoid the out-of-order issues. Or, we could timestamp resources on DB to make sure we always keep the latest update of an object (but that gets complicated on composited objects: like when we add "qos_policy_id" by extending a port)..