I have some thoughts in my mind for this problem as below:
1, First of all, we need to figure out why it will appear multiple ACTIVE master HA nodes in theory ?
Assume the master is dead (at this time, it's status in DB is still ACTIVE), then slave will be selected to new master. After the old master has recovered, the L444 this.enable_keepalived() [4] will be invoked to spawn keepalived instance, so multiple ACTIVE master HA nodes occur. (Related patch - https://review.openstack.org/#/c/357458/)
So the key to solving this problem is to reset the status of all HA ports into DOWN at a certain code path, so the patch https://review.openstack.org/#/c/470905/ is used to address this point. But this patch sets the status=DOWN at this code path 'fetch_and_sync_all_routers -> get_router_ids' which will lead to a bigger problem when the load is large.
2, Why setting status=DOWN in the code path 'fetch_and_sync_all_routers -> get_router_ids' will lead to a bigger problem when the load is large ?
If l3-agent is not active via heartbeat check, l3-agent will be set status=AGENT_REVIVED [1], then l3-agent will be triggered to do a full sync (self.fullsync=True) [2] so that the code logic 'periodic_sync_routers_task -> fetch_and_sync_all_routers' will be called again and again [3].
All these operations will aggravate the load for l2-agent, l2-agent, DB and MQ etc. Conversely, large load also will aggravate AGENT_REVIVED case.
So it's a vicious circle, the patch https://review.openstack.org/#/c/522792/ is used to address this point. It uses the code path '__init__ -> get_service_plugin_list -> _update_ha_network_port_status' instead of the code path 'periodic_sync_routers_task -> fetch_and_sync_all_routers'.
3, We have known, the small heartbeat value can cause AGENT_REVIVED then aggravate the load, the high load can cause other problems, like some phenomenons Xav mentioned before, I pasted them as below as well:
- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.
This is just an example, there may be other circumstances. All those let us mistake the fix doesn't fix the problem.
The high load can also cause other similar problem, for another example:
a, can cause the process neutron-keepalived-state-change to exit due to term singal [5] (https://paste.ubuntu.com/26450042/), neutron-keepalived-state-change is used to monitor vrrp's VIP change then update the ha_router's status to neutron-server [6]. so that l3-agent will not be able to update the status for ha ports, thus we can see multiple ACTIVE case or multiple STANDBY case or others.
b, can cause the RPC message sent from here [6] can not be handled well.
I have some thoughts in my mind for this problem as below:
1, First of all, we need to figure out why it will appear multiple ACTIVE master HA nodes in theory ?
Assume the master is dead (at this time, it's status in DB is still ACTIVE), then slave will be selected to new master. After the old master has recovered, the L444 this.enable_ keepalived( ) [4] will be invoked to spawn keepalived instance, so multiple ACTIVE master HA nodes occur. (Related patch - https:/ /review. openstack. org/#/c/ 357458/)
So the key to solving this problem is to reset the status of all HA ports into DOWN at a certain code path, so the patch https:/ /review. openstack. org/#/c/ 470905/ is used to address this point. But this patch sets the status=DOWN at this code path 'fetch_ and_sync_ all_routers -> get_router_ids' which will lead to a bigger problem when the load is large.
2, Why setting status=DOWN in the code path 'fetch_ and_sync_ all_routers -> get_router_ids' will lead to a bigger problem when the load is large ?
If l3-agent is not active via heartbeat check, l3-agent will be set status= AGENT_REVIVED [1], then l3-agent will be triggered to do a full sync (self.fullsync= True) [2] so that the code logic 'periodic_ sync_routers_ task -> fetch_and_ sync_all_ routers' will be called again and again [3].
All these operations will aggravate the load for l2-agent, l2-agent, DB and MQ etc. Conversely, large load also will aggravate AGENT_REVIVED case.
So it's a vicious circle, the patch https:/ /review. openstack. org/#/c/ 522792/ is used to address this point. It uses the code path '__init__ -> get_service_ plugin_ list -> _update_ ha_network_ port_status' instead of the code path 'periodic_ sync_routers_ task -> fetch_and_ sync_all_ routers' .
3, We have known, the small heartbeat value can cause AGENT_REVIVED then aggravate the load, the high load can cause other problems, like some phenomenons Xav mentioned before, I pasted them as below as well:
- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.
This is just an example, there may be other circumstances. All those let us mistake the fix doesn't fix the problem.
The high load can also cause other similar problem, for another example:
a, can cause the process neutron- keepalived- state-change to exit due to term singal [5] (https:/ /paste. ubuntu. com/26450042/), neutron- keepalived- state-change is used to monitor vrrp's VIP change then update the ha_router's status to neutron-server [6]. so that l3-agent will not be able to update the status for ha ports, thus we can see multiple ACTIVE case or multiple STANDBY case or others.
b, can cause the RPC message sent from here [6] can not be handled well.
So for this problem, my concrete opinion is:
a, bump up heartbeat option (agent_down_time)
b, we need this patch: https:/ /review. openstack. org/#/c/ 522641/
c, Ensure that other components (like MQ, DB etc) have no performance problems
[1] https:/ /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ db/agents_ db.py#L354 /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ agent/l3/ agent.py# L736 /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ agent/l3/ agent.py# L583 /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ agent/l3/ ha_router. py#L444 /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ agent/l3/ keepalived_ state_change. py#L134 /github. com/openstack/ neutron/ blob/stable/ ocata/neutron/ agent/l3/ keepalived_ state_change. py#L71
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/