Comment 10 for bug 1516260

Revision history for this message
Oleg Bondarev (obondarev) wrote :

I'll try to summarise:
the problem is that starting after a certain amount of routers assigned to a particular agent, neutron server fails to serve get_routers request for that agent within default rpc timeout time.

Two solutions were proposed:
1) just increase timeout for agent waiting for sync_routers and make it 300 seconds https://review.openstack.org/#/c/245432/
 Pros:
 - very simple
 Cons:
 - not a complete fix, at some point (more routers) timeouts may appear again (can be changed to increase timeout dynamically)
 - agent is just waiting and doing nothing so this also increases the time needed for agent to sync state
 - load on server
 - giant rpc message (load on Rabbit)

2) paginate sync routers task with dynamic chunk size https://review.openstack.org/#/c/234067/
 Pros:
 - can name it complete fix (if server fails to serve get_routers for min chunk size of 32 then I think it's another problem)
 - agent starts router processing earlier (after receiving the first chunk) so this decreases sync state time
 - load on server is smoothed
 - no giant rpc messages
 Cons:
 - more complex fix
 - more rpc chattiness on resync (not smth that we should worry about IMO, it is a negligible increase if compare to average chattiness of a loaded cluster, also resync is not something that we expect to happen often)

I don't like the idea of combining the two approaches: bigger timeout will increase the time needed for agent to adjust the chunk size, which leads to overall time increase for resync. Also max chunk size is set to 256 and if server fails to serve 256 routers in 60 seconds we may have another problem which we're masking with bigger timout)