neutron

L3 agent sync_routers timeouts may cause cluster to fall down

Bug #1516260 reported by Assaf Muller on 2015-11-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Oleg Bondarev

Bug Description

L3 agent 'sync_routers' RPC call is sent when the agent starts or when an exception occurs. It uses a default timeout of 60 seconds (An Oslo messaging config option). At scale the server can take a long time to answer, causing a timeout and the message is sent again, causing a cascading failure and the situation does not resolve itself. The sync_routers server RPC response was optimized to mitigate this, it could also be helpful to simply increase the timeout.

Tags:

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-11-16:

Duplicate of https://bugs.launchpad.net/neutron/+bug/1505575 ?

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-11-16:

@obondarev, it could be, may be DVR makes it more likely to happen?.

I think we should add an step back, with increasing retry times mechanism to resync operations which fail with timeout. Or by implementing a circuit breaking pattern in agent to server operations [1]

[1] https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-11-18:

@Miguel , yes, DVR causes it to manifest earlier but it's not a DVR specific problem.

OpenStack Infra (hudson-openstack) on 2015-12-04

Changed in neutron:
status:	New → In Progress

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-12-07:

I removed the duplicate since bug 1505575 was about fatal memory consumption which was fixed. Timeouts and endless loop is still a problem however.

OpenStack Infra (hudson-openstack) on 2015-12-07

Changed in neutron:
assignee:	Assaf Muller (amuller) → Oleg Bondarev (obondarev)

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-12-07:

I think this is should be of high priority

Changed in neutron:
importance:	Low → High

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-12-07:

What is the patch for this? I noticed "OpenStack Infra" marked it "In Progress" but I don't see a link to the patch. Would be nice to have a link.

Revision history for this message

Assaf Muller (amuller) wrote on 2015-12-07:

https://review.openstack.org/#/c/245432/

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-12-08:

Another approach https://review.openstack.org/#/c/234067/

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-12-17:

I don't understand the relationship between the two proposed fixes. Would each one independently fix this bug? That is what their commit messages lead me to believe.

Or, are they both necessary together?

In my opinion, if they both address the problem independently and we still want to merge them both then we need two bugs for this. Is that still the plan? Otherwise, we're going to be confused about the state of this bug when one/both merges. And then if we want to consider a backport of one or both, the confusion will continue.

Please clarify the relationship between the two proposed patches.

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2015-12-18:

#10

I'll try to summarise:
the problem is that starting after a certain amount of routers assigned to a particular agent, neutron server fails to serve get_routers request for that agent within default rpc timeout time.

Two solutions were proposed:
1) just increase timeout for agent waiting for sync_routers and make it 300 seconds https://review.openstack.org/#/c/245432/
Pros:
- very simple
Cons:
- not a complete fix, at some point (more routers) timeouts may appear again (can be changed to increase timeout dynamically)
- agent is just waiting and doing nothing so this also increases the time needed for agent to sync state
- load on server
- giant rpc message (load on Rabbit)

2) paginate sync routers task with dynamic chunk size https://review.openstack.org/#/c/234067/
Pros:
- can name it complete fix (if server fails to serve get_routers for min chunk size of 32 then I think it's another problem)
- agent starts router processing earlier (after receiving the first chunk) so this decreases sync state time
- load on server is smoothed
- no giant rpc messages
Cons:
- more complex fix
- more rpc chattiness on resync (not smth that we should worry about IMO, it is a negligible increase if compare to average chattiness of a loaded cluster, also resync is not something that we expect to happen often)

I don't like the idea of combining the two approaches: bigger timeout will increase the time needed for agent to adjust the chunk size, which leads to overall time increase for resync. Also max chunk size is set to 256 and if server fails to serve 256 routers in 60 seconds we may have another problem which we're masking with bigger timout)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-18: Change abandoned on neutron (master)

#11

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/245432
Reason: Abandoned in favor of: https://review.openstack.org/234067/

Revision history for this message

Assaf Muller (amuller) wrote on 2015-12-18:

#12

Oleg's patch should do the trick.

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-12-18:

#13

There are a couple of related bugs: Pagination has been proposed for at least one of them. It was essentially rejected here: https://review.openstack.org/#/c/163594/

https://bugs.launchpad.net/neutron/+bug/1430999 (OVS agent)
https://bugs.launchpad.net/neutron/+bug/1525753 (DHCP agent)

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-12-18:

#14

This is also discussed in Kevin's blueprint: https://review.openstack.org/#/c/225995/3/specs/mitaka/push-notifications.rst

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-16: Fix merged to neutron (master)

#15

Reviewed: https://review.openstack.org/234067
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Submitter: Jenkins
Branch: master

commit 0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Author: Oleg Bondarev <email address hidden>
Date: Tue Oct 13 12:45:59 2015 +0300

L3 agent: paginate sync routers task

    In case there are thousands of routers attached to thousands of
    networks, sync_routers request might take a long time and lead to timeout
    on agent side, so agent initiate another resync. This may lead to an endless
    loop causing server overload and agent not being able to sync state.

    This patch makes l3 agent first check how many routers are assigned to
    it and then start to fetch routers by chunks.
    Initial chunk size is set to 256 but may be decreased dynamically in case
    timeouts happen while waiting response from server.

    This approach allows to reduce the load on server side and to speed up
    resync on agent side by starting processing right after receiving
    the first chunk.

Closes-Bug: #1516260
Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Thierry Carrez (ttx) wrote on 2016-01-19: Fix included in openstack/neutron 8.0.0.0b2

#16

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Oleg Bondarev (obondarev) on 2016-01-21

tags:	added: liberty-backport-potential
tags:	removed: liberty-backport-potential

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.