Comment 21 for bug 1749425

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi folks, while trying to reproduce this behaviour myself I think i've stumbled upon some interesting behaviour. I setup a test as follows and checked for errors at specific points. I have a 4 node setup (24 core/64G ram) with 3 gateways and 1 compute. Neutron is configured with l3_ha enabled and max_agents_per_router set to 3;

 stage 1: created two projects with 1 router each (which gives two sets of keepalived each with the same VR_ID (1)) and checked keepalived logs - system load is minimal, no re-elections observed post-create.

 stage 2: scaled horizontally to 200 projects each with 1 router (giving 200 routers with VR_ID 1 each within their own network). system load is minimal, no re-elections observed post-create, observed that all master state routers are on the same host.

 stage 3: scaled one project vertically by creating 200 routers within same project. As i started to get into the VR_70s i started to see some of the extant routers get re-elected e.g. "VRRP_Instance(VR_76) Received higher prio advert". If i run a tcpdump on one of my ha- interfaces inside a qrouter- namespace I see a flood of "VRRPv2, Advertisement" with each VR_ID being advertised every 2s from the current master (as expected since that's the default interval in neutron). The consequence of this is that neutron is frequently having to cathup with keepalived (by running neutron-keepalived-state-change) which causes more traffic and all without cause since there is no need for these failovers to be occurring.

 Since the advert interval is configurable in neutron [1] I am going to go ahead and try changing it to see of that can stop these re-elections but that seems a little hacky as a fix so just wondering if there's another way to mitigate these effects. I need to double check the vrrp spec but iirc since these advertisements are sent out by the master, if the master dies it would affect how long it takes for a re-election to occur (spec says "(3 * Advertisement_Interval) + Skew_time") and during that time VMs would be unreachable so maybe there's another way.

[1] https://github.com/openstack/neutron/blob/b90ec94dc3f83f63bdb505ace1e4c272435c494b/neutron/conf/agent/l3/ha.py#L35