Comment 0 for bug 1365453

Revision history for this message
Assaf Muller (amuller) wrote : It's very difficult to know what is the state of a HA router on each L3 agent

It's very difficult to know what is the state of a HA router on each L3 agent, and who is the master instance and where it is hosted. This is a maintenance nightmare. Currently, the only way to know is to SSH into an agent, and:

cat $state_path/ha_confs/router_id/state

But this method requires accessing each individual agent. A more user friendly way would be to expose this via the API so admins could query on a router and get a list of agents and the state on each agent:

router-show <router_id>
Would show a list of agents the router is scheduled to, and the state of the router on each agent.

This is harder than it sounds and requires a few design decisions.

Keepalived doesn't support a way to query the current VRRP state. The only way to know then is to use notifier scripts. These scripts are executed when a state transition occurs, and receive the new state (Master, backup, fault). Every time we reconfigure keepalived (When the router is created and updated) we write a bash executable with the router ID and VRRP state. This is the script that we configure keepalived to execute. The bash script then passes these two parameters to a Python executable that passes the information via a Unix domain socket to the L3 agent (I expect the Python script to grow). The L3 agent will batch these state change notifications over a period of T seconds. When T seconds have passed and no new notification has arrived it will send a RPC message to the server with a map of router ID to VRRP state on that specific agent. If the agent crashes after a notification has been queued but before it's been sent, we'll also write each notification to disk, and when the L3 agent starts it will en-queue all of these notifications and remove them from the disk.

The server will then persist this information following the RPC message: The tables are already set up for this. Each router has an entry in the HA bindings table per agent it is scheduled to, and the record contains the VRRP state on that specific agent. The API response for this router will now also contain a dict of {agent_id: VRRP state}.