L3 agent sync_routers timeouts may cause cluster to fall down

Bug #1516260 reported by Assaf Muller
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Oleg Bondarev

Bug Description

L3 agent 'sync_routers' RPC call is sent when the agent starts or when an exception occurs. It uses a default timeout of 60 seconds (An Oslo messaging config option). At scale the server can take a long time to answer, causing a timeout and the message is sent again, causing a cascading failure and the situation does not resolve itself. The sync_routers server RPC response was optimized to mitigate this, it could also be helpful to simply increase the timeout.

Tags: l3-ipam-dhcp
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

@obondarev, it could be, may be DVR makes it more likely to happen?.

I think we should add an step back, with increasing retry times mechanism to resync operations which fail with timeout. Or by implementing a circuit breaking pattern in agent to server operations [1]

[1] https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern

Revision history for this message
Oleg Bondarev (obondarev) wrote :

@Miguel , yes, DVR causes it to manifest earlier but it's not a DVR specific problem.

Changed in neutron:
status: New → In Progress
Revision history for this message
Oleg Bondarev (obondarev) wrote :

I removed the duplicate since bug 1505575 was about fatal memory consumption which was fixed. Timeouts and endless loop is still a problem however.

Changed in neutron:
assignee: Assaf Muller (amuller) → Oleg Bondarev (obondarev)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

I think this is should be of high priority

Changed in neutron:
importance: Low → High
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

What is the patch for this? I noticed "OpenStack Infra" marked it "In Progress" but I don't see a link to the patch. Would be nice to have a link.

Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I don't understand the relationship between the two proposed fixes. Would each one independently fix this bug? That is what their commit messages lead me to believe.

Or, are they both necessary together?

In my opinion, if they both address the problem independently and we still want to merge them both then we need two bugs for this. Is that still the plan? Otherwise, we're going to be confused about the state of this bug when one/both merges. And then if we want to consider a backport of one or both, the confusion will continue.

Please clarify the relationship between the two proposed patches.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

I'll try to summarise:
the problem is that starting after a certain amount of routers assigned to a particular agent, neutron server fails to serve get_routers request for that agent within default rpc timeout time.

Two solutions were proposed:
1) just increase timeout for agent waiting for sync_routers and make it 300 seconds https://review.openstack.org/#/c/245432/
 Pros:
 - very simple
 Cons:
 - not a complete fix, at some point (more routers) timeouts may appear again (can be changed to increase timeout dynamically)
 - agent is just waiting and doing nothing so this also increases the time needed for agent to sync state
 - load on server
 - giant rpc message (load on Rabbit)

2) paginate sync routers task with dynamic chunk size https://review.openstack.org/#/c/234067/
 Pros:
 - can name it complete fix (if server fails to serve get_routers for min chunk size of 32 then I think it's another problem)
 - agent starts router processing earlier (after receiving the first chunk) so this decreases sync state time
 - load on server is smoothed
 - no giant rpc messages
 Cons:
 - more complex fix
 - more rpc chattiness on resync (not smth that we should worry about IMO, it is a negligible increase if compare to average chattiness of a loaded cluster, also resync is not something that we expect to happen often)

I don't like the idea of combining the two approaches: bigger timeout will increase the time needed for agent to adjust the chunk size, which leads to overall time increase for resync. Also max chunk size is set to 256 and if server fails to serve 256 routers in 60 seconds we may have another problem which we're masking with bigger timout)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/245432
Reason: Abandoned in favor of: https://review.openstack.org/234067/

Revision history for this message
Assaf Muller (amuller) wrote :

Oleg's patch should do the trick.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

There are a couple of related bugs: Pagination has been proposed for at least one of them. It was essentially rejected here: https://review.openstack.org/#/c/163594/

https://bugs.launchpad.net/neutron/+bug/1430999 (OVS agent)
https://bugs.launchpad.net/neutron/+bug/1525753 (DHCP agent)

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/234067
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Submitter: Jenkins
Branch: master

commit 0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Author: Oleg Bondarev <email address hidden>
Date: Tue Oct 13 12:45:59 2015 +0300

    L3 agent: paginate sync routers task

    In case there are thousands of routers attached to thousands of
    networks, sync_routers request might take a long time and lead to timeout
    on agent side, so agent initiate another resync. This may lead to an endless
    loop causing server overload and agent not being able to sync state.

    This patch makes l3 agent first check how many routers are assigned to
    it and then start to fetch routers by chunks.
    Initial chunk size is set to 256 but may be decreased dynamically in case
    timeouts happen while waiting response from server.

    This approach allows to reduce the load on server side and to speed up
    resync on agent side by starting processing right after receiving
    the first chunk.

    Closes-Bug: #1516260
    Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

tags: added: liberty-backport-potential
tags: removed: liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.