l3 agent can be marked dead while it reschedules a lot of resources

Bug #1440761 reported by Ann Taraday on 2015-04-06
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

If l3 agent get killed and there are a lot of resources assigned on it, the agent on which this resources are rescheduling can be marked as dead for neutron-server because state reports are not received in agent_down_time*2.

This was tested with agent_down_time=15 and 100 routers for rescheduling.

Changed in neutron:
assignee: nobody → Ann Kamyshnikova (akamyshnikova)
Eugene Nikanorov (enikanorov) wrote :

This is similar issue which was previously found for DHCP agent.
Since for L3 case it has less chances to appear (need many routers, low 'agent_down' parameters) setting the importance to 'Low'.

We need to give few additional seconds for L3 agent that has received a bunch of routers before considering it dead and moving routers from it.

Changed in neutron:
importance: Undecided → Low
tags: added: l3-ipam-dhcp
Changed in neutron:
status: New → Confirmed
Changed in neutron:
status: Confirmed → In Progress
Assaf Muller (amuller) wrote :

I think the bug description should have a link to the similar DHCP bug. Also, I saw the patch fixes the issue in the server, but isn't this an agent bug? Why does the new agent that receives the moved resources not send an update in time?

Eugene Nikanorov (enikanorov) wrote :

Assaf, the issue with DHCP agents was solved during initial development of rescheduling of networks.

It was discovered that during network processing DHCP agent spawns lots of greenthreads and greenthread sending state report doesn't receive control for quite a long time which can be enough to consider the agent dead.

With L3 agent we see similar issue. The solution would be to prioritize greenthread responsible for sending agent heartbeats, but I'm not aware of prioritization support in eventlet.

Carl Baldwin (carl-baldwin) wrote :

I don't want to dismiss @Assaf's question so quickly. I'm not sure that I'm quite satisfied that is has been resolved.

The number of worker threads should be very limited in the L3 agent. The _process_routers_loop method limits the pool to a size of 8 threads. Also, there is a lot of IO going on in each of these threads. My experience is that each thread yields regularly. How is the state reporting getting starved? I restart production L3 agents routinely and we don't see this issue of the agent being reported as dead by the neutron server. We have in the 100s of routers per agent.

Eugene Nikanorov (enikanorov) wrote :

That's true that this problem is much less severe for L3 agents for various reasons, including those you mentioned.
However under lowered agent_down_time (default 75 seconds is too much) this issue still can show up from time to time.

However I'm not insisting on this being fixed.

Carl Baldwin (carl-baldwin) wrote :

Just checking in. Thanks @Eugene for the your comment. What do others think? Is this worth the complexity to fix? Or, should we try to understand better what is going on? I won't insist either way without hearing other opinions.

The problem described here is a clear indication of the fact that we rely on a single-threshold based fault detection algorithm, which does not discriminate a persistent failure from a transient one. Increasing the timeout blindly, or changing the rescheduling logic slightly is still not got to get rid of the occasional misstep of reporting back to the server due to yet another unforeseen issue that we haven't thought of or not encountered.

Perhaps, it would be useful to think about ways to discriminate between the various faults (permanent, intermittent, and transient). For that we'd need to revise the single threshold based fault detection algorithm in a way that the lack of report does not automatically trigger a state change UP => DOWN, but an intermediate one UP => DEGRADED; at this point we'd start watching the component more closely. If the component comes back within another window of time, then we got back from DEGRADED => UP, if it doesn't, we can safely say that the component is going from DEGRADED to DOWN and we can activate any other fail over strategies.

Some may argue that this effectively means increasing the timeout and cause further delay in fault recovery. Setting timeouts is inherently a trade-off activity: this tradeoff is made to avoid the potential false positive that may cause a needless rescheduling havoc, and the overall time window can still be tuned to be kept within sensible boundaries.

My 2c

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/171592
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

This bug is > 172 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → nobody
status: In Progress → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers