[library] l3-agent active-passive failover takes 30+ seconds when a controller fails

Bug #1328970 reported by Chris Clason
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Sergey Vasilenko

Bug Description

l3-agent takes 30+ seconds to failover to a standby controller when a controller node fails. Customers are asking for this time to be minimized as much as possible.

Andrew Woodward (xarses)
tags: added: customer-found
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
tags: added: ha
Changed in fuel:
milestone: none → 5.1
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Andrew Woodward (xarses) wrote :

There are hold down sleeps inside both the ocf script (+33 Sec) https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/neutron/files/ocf/neutron-agent-l3#L411 and some in reconnect sleeps in https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/neutron/files/q-agent-cleanup.py#L420. Since we have cleaned up service recovery we should be able to remove most of these and have a fast re-connect attempt to fix this for us if there are still gaps.

Changed in fuel:
status: Confirmed → Triaged
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Dmitry Ilyin (idv1985)
summary: - l3-agent active-passive failover takes 30+ seconds when a controller
- fails
+ [puppet] l3-agent active-passive failover takes 30+ seconds when a
+ controller fails
Dmitry Ilyin (idv1985)
summary: - [puppet] l3-agent active-passive failover takes 30+ seconds when a
+ [library] l3-agent active-passive failover takes 30+ seconds when a
controller fails
Changed in fuel:
milestone: 5.1 → 6.0
tags: added: release-notes
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Sergey, can you provide an update? Will this be fixed in 6.0 or moved to 6.1?

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

This will be automatically fixed after "multiple-l3-agents" feature implemented.
But DOS timeout dapends also of rabbitmq stability. If failover leads to the rabbitmq cluster rebuild process, router can't be rescheduling to another L3 agent while rabbitmq cluster done rebuild.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This big may be fixed by multiple l3 agents feature which is now merged.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

+100500

Changed in fuel:
status: Triaged → Invalid
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this bug was fixed by multiple-l3-agents blueprint

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.