Allow an admin to evacuate a L3 agent from HA routers

Bug #1365470 reported by Assaf Muller
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Sridhar Gaddam

Bug Description

If an admin would like to perform a planned maintenance on a host with L3 agent installed, shutting down the L3 agent wouldn't failover HA routers on that node. Rebooting the machine or otherwise destroying connectivity would put all of the routers in a fault state and router instances on other agents will become the new masters. However, we should have a way to let admins explicitly evacuate a node. I suggest that setting a l3 agent as down:

neutron agent-update <agent_id> --admin_state_up=False

Tags: l3-ha
Changed in neutron:
assignee: nobody → tcs_openstack_group (tcs-openstack-group)
Revision history for this message
Assaf Muller (amuller) wrote :

Not 100% sure if we should lower the priority or shut down all of the interfaces of each HA router.

Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: tcs_openstack_group (tcs-openstack-group) → Mounika (mounika-pandhiri)
Revision history for this message
Mounika (mounika-pandhiri) wrote :

unable to replicate the bug as HARouterScheduler class is not present in l3_agent_scheduler.py.
code merges are still being done.

Revision history for this message
Assaf Muller (amuller) wrote :

All of the code for the feature has been merged :)

If you check out l3_agent_scheduler, you'll see that the random scheduler and least routers scheduler have a _choose_router_agents_for_ha method which bind_ha_router uses.

On a side note, why do we care about scheduling within the context of this bug?

Changed in neutron:
assignee: Mounika (mounika-pandhiri) → nobody
Revision history for this message
Rounak (rounak-pramanik) wrote :

"neutron router-create router_name --ha=True" is also not working(throwing an error : bad request) after making change l3_ha=True in /etc/neutron/neutron.conf in latest master version.

Revision history for this message
Assaf Muller (amuller) wrote :

Please paste the error.

Revision history for this message
Rounak (rounak-pramanik) wrote :

"neutron router-create router_name --ha=True" is throwing an error:
 Bad Request (HTTP 400) (Request-ID: req-d86a7ac7-ad55-4d46-9577-670f679b3954)

Revision history for this message
Assaf Muller (amuller) wrote :

Can you paste the trace from the server log?

Changed in neutron:
assignee: nobody → Sridhar Gaddam (sridhargaddam)
Revision history for this message
Sridhar Gaddam (sridhargaddam) wrote :

@Assaf, I had a look at this and following are my observations.

In the latest code, when we set the admin_state of an agent to False, HA router on that agent is getting deleted.
In such situations, one of the backup HA Routers is taking over the role of Master.
So, I hope the requirements of the Bug are met and it is not applicable with the present code.
If my understanding is wrong, please let me know.

OTOH, I wanted to see the behavior of keepalived (version v1.2.16) when we update the priority and SIGHUP the process.
I see that when we lower the priority of the master HA Router and SIGHUP, it continues to serve the role of Master* even with the lower priority.
This is a bit strange.

[*] with some small outage (for few seconds) when one of the backup routers takeover the role of Master and goes back to Backup.

10:25:25.316856 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:25:27.317914 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:25:34.123811 IP 169.254.192.2 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:25:34.124913 IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:25:34.125432 IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
...
<SNIP> During this period HA router with IP 169.254.192.1 was acting as master.
...
10:26:06.943902 IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:26:08.944739 IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:26:08.946743 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:26:08.947452 IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20
10:26:10.949838 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 10, authtype none, intvl 2s, length 20
10:26:12.950664 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 10, authtype none, intvl 2s, length 20
10:26:14.952486 IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 10, authtype none, intvl 2s, length 20

Note: When we lower the priority for a backup HA router and SIGHUP, as expected, it remains in the backup state.

Revision history for this message
Assaf Muller (amuller) wrote :

Changed to fix committed according to Sridhar's comment.

@Sridhar: We use no-preemption so I don't think that priority settings are being taken in to account. If you'd like to fiddle around with it locally you can set pre-emption in the keepalived.conf template we use then play around with the priorities.

Changed in neutron:
status: New → Fix Committed
description: updated
Revision history for this message
Sridhar Gaddam (sridhargaddam) wrote :

Yes @Assaf, I agree with you. The no-preempt flag would have an effect on the priority settings.

Changed in neutron:
milestone: none → liberty-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-2 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.