Mirantis OpenStack

[Backport 1516260] L3 agent sync_routers timeouts may cause cluster to fall down

Bug #1536954 reported by Oleg Bondarev on 2016-01-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Fix Released	High	Oleg Bondarev	Mirantis OpenStack 8.0

Bug Description

Upstream bug: https://bugs.launchpad.net/neutron/+bug/1516260

L3 agent 'sync_routers' RPC call is sent when the agent starts or when an exception occurs. It uses a default timeout of 60 seconds (An Oslo messaging config option). At scale the server can take a long time to answer, causing a timeout and the message is sent again, causing a cascading failure and the situation does not resolve itself. The sync_routers server RPC response was optimized to mitigate this, it could also be helpful to simply increase the timeout.

Tags:

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-22: Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/16367

Changed in mos:
status:	New → In Progress

Oleg Bondarev (obondarev) on 2016-01-22

tags:

added: scale

Alexander Ignatov (aignatov) on 2016-01-22

Changed in mos:
milestone:	none → 8.0

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-26: Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/16367
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 062d16ac0163f921d4255d9b77d6f903a7c5f110
Author: Oleg Bondarev <email address hidden>
Date: Fri Jan 22 09:00:46 2016

L3 agent: paginate sync routers task

In case there are thousands of routers attached to thousands of
networks, sync_routers request might take a long time and lead to timeout
on agent side, so agent initiate another resync. This may lead to an endless
loop causing server overload and agent not being able to sync state.

This patch makes l3 agent first check how many routers are assigned to
it and then start to fetch routers by chunks.
Initial chunk size is set to 256 but may be decreased dynamically in case
timeouts happen while waiting response from server.

This approach allows to reduce the load on server side and to speed up
resync on agent side by starting processing right after receiving
the first chunk.

upstream review: https://review.openstack.org/234067

Closes-Bug: #1536954
Closes-Bug: #1516260
Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

Mikhail Chernik (mchernik) wrote on 2016-02-24:

Verified on MOS 8.0 RC2 (ISO 570), fixed

Changed in mos:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.