State of HA routers not exposed via API

Bug #1365453 reported by Assaf Muller on 2014-09-04
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Assaf Muller

Bug Description

It's very difficult to know what is the state of a HA router on each L3 agent, and who is the master instance and where it is hosted. This is a maintenance nightmare. What if I have split brain? What if I want to know where is the master so I can manually move it? Currently, the only way to know is to SSH into an agent, and:

cat $state_path/ha_confs/router_id/state

But this method requires accessing each individual agent. A more user friendly way would be to expose this via the API so admins could query on a router and get a list of agents and the state on each agent:

l3-agent-list-hosting-router <router_id>
Currently shows all of the agents hosting the requested router. It will now also show the HA state on each agent (Which agent is the master, and who are the standbys).

Implementation choices:
Keepalived doesn't support a way to query the current VRRP state. The only way to know then is to use notifier scripts. These scripts are executed when a state transition occurs, and receive the new state (Master, backup, fault).

Every time we reconfigure keepalived (When the router is created and updated) we tell it to execute a Python script (That is maintained as part of the repository). The script will:
1) Write the new state to a file in $state_path/ha_confs/router_id/state
2) Start the metadata proxy if the transition was to master, or shut it down if the transition was to backup or fault.
3) Notify the agent that a transition has occurred via a unix domain socket. The reason that 1 & 2 will happen in the script and not in the agent after it receives the notification is that we want to execute steps 1 & 2 even if the agent is down.

The L3 agent will batch these state change notifications over a period of T seconds. When T seconds have passed and no new notifications have arrived it will send a RPC message to the server with a map of router ID to VRRP state on that specific agent. Every time the agent starts it will perform the full sync from the controller (Get all routers, configure them, clean up old namespaces), wait for state transitions to die down, read the current state of each router and send an update to the controller. The RPC message send will be retried indefinitely in case the management network is temporarily down, or the agent is disconnected from it.

The server will then persist this information following the RPC message: The tables are already set up for this. Each router has an entry in the HA bindings table per agent it is scheduled to, and the record contains the VRRP state on that specific agent. No DB migration will be necessary.

Why can't the l3-agent just read $state_path/ha_confs/router_id/state ?

Assaf Muller (amuller) wrote :

Of course it can, but when? When would the L3 agent update the controller if it isn't getting updates from the keepalived notify scripts?

Changed in neutron:
importance: Undecided → High
Assaf Muller (amuller) on 2014-09-16
Changed in neutron:
assignee: nobody → Assaf Muller (amuller)
status: New → In Progress
Assaf Muller (amuller) on 2014-09-23
description: updated
Assaf Muller (amuller) on 2014-09-24
description: updated

Related fix proposed to branch: master
Review: https://review.openstack.org/125339

Related fix proposed to branch: master
Review: https://review.openstack.org/125384

As you mentioned, the data model keeps track of whether each agent associated to the router is an active or a passive replica (see [1]). So if I understand this correctly, this bug should be about exposing this information to the admin and to ensure the data model it's kept in sync with the backend implementation. For this reason, I found the bug report title, and the description a bit confusing.

Also, wouldn't be this command:

neutron l3-agent-list-hosting-router

more suitable?

The output today is the list of agents hosting the router, but we could add more attributes, like the state, when the router is HA.

[1] - https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L82

Assaf Muller (amuller) wrote :

I changed the title to be more straight forward. As for the CLI command, router-show is where I'd intuitively look as a user, only because l3-agent-list-hosting-router isn't as well known. I've been debating this internally and the latter does seem more well suited.

summary: - It's very difficult to know what is the state of a HA router on each L3
- agent
+ State of HA routers not exposed via API
description: updated

Related fix proposed to branch: master
Review: https://review.openstack.org/126188

Agree with you regarding the router-show vs l3-agent-list-hosting-router; the problem with the former is that its tabular form does not lend itself to representing a list of agents associated to the router very well, so I suspect that the rendering will be pretty ugly.

Regardless, the API is what counts.

Mark McClain (markmcclain) wrote :

This feels like a new feature and not bug fix. I'd rather see this go through the spec process because the changes to the notifications need to be coordinated with API refactoring.

Assaf Muller (amuller) wrote :

Blueprint:
https://blueprints.launchpad.net/neutron/+spec/report-ha-router-master

Spec review:
https://review.openstack.org/#/c/128613/

Mark, what do you mean by:
"the changes to the notifications need to be coordinated with API refactoring."?
Can you expand?

Eugene Nikanorov (enikanorov) wrote :

Should we move this effort under BP completely?

tags: added: api
Changed in neutron:
status: In Progress → Incomplete
Assaf Muller (amuller) wrote :

Yeah, now that there's a blueprint for this effort there's no need to have both a blueprint and a bug.

Changed in neutron:
status: Incomplete → Invalid

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/125973
Reason: Squashed into previous commit.

Changed in neutron:
status: Invalid → In Progress
Assaf Muller (amuller) wrote :

Removed Related and Closes bug tags from all patches in the series.

Changed in neutron:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers