[Backport 1516260] L3 agent sync_routers timeouts may cause cluster to fall down

Bug #1536954 reported by Oleg Bondarev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Oleg Bondarev

Bug Description

Upstream bug: https://bugs.launchpad.net/neutron/+bug/1516260

L3 agent 'sync_routers' RPC call is sent when the agent starts or when an exception occurs. It uses a default timeout of 60 seconds (An Oslo messaging config option). At scale the server can take a long time to answer, causing a timeout and the message is sent again, causing a cascading failure and the situation does not resolve itself. The sync_routers server RPC response was optimized to mitigate this, it could also be helpful to simply increase the timeout.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/16367

Changed in mos:
status: New → In Progress
tags: added: scale
Changed in mos:
milestone: none → 8.0
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/16367
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 062d16ac0163f921d4255d9b77d6f903a7c5f110
Author: Oleg Bondarev <email address hidden>
Date: Fri Jan 22 09:00:46 2016

L3 agent: paginate sync routers task

In case there are thousands of routers attached to thousands of
networks, sync_routers request might take a long time and lead to timeout
on agent side, so agent initiate another resync. This may lead to an endless
loop causing server overload and agent not being able to sync state.

This patch makes l3 agent first check how many routers are assigned to
it and then start to fetch routers by chunks.
Initial chunk size is set to 256 but may be decreased dynamically in case
timeouts happen while waiting response from server.

This approach allows to reduce the load on server side and to speed up
resync on agent side by starting processing right after receiving
the first chunk.

upstream review: https://review.openstack.org/234067

Closes-Bug: #1536954
Closes-Bug: #1516260
Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Mikhail Chernik (mchernik) wrote :

Verified on MOS 8.0 RC2 (ISO 570), fixed

Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.