HA router & L2 pop: Link could not recover if tunnel NIC meets down and up

Bug #1637452 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Description:
If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up.
And there is no way to quickly recover.

ENV:
stable/mitaka (8.1.2 with backported L3 patches)
VXLAN

Some related settings:

l2_population = true
arp_responder = true
tunnel_types = vxlan
prevent_arp_spoofing = true

l3_ha = True
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2

How to reproduce:
Assuming we have and only have:
two network nodes: N1 and N2
two compute nodes: C1 and C2
two HA routers: R1 and R2
two networks/subnets for VM: NET1 and NET2

Steps:
1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network
    phenomenon:
      You may see all HA router become both active.
      AKA, R1 and R2 are now all active state in both N1 and N2.

2.create VM1 and VM2
   (You can use `nova boot --availability-zone:specific_host`
    to create VM to the compute node you needed.)
   Then the topology will be:
   R1 -- NET1 -- VM1 (reside in C1)
   R2 -- NET2 -- VM2 (reside in C2)

   phenomenon:
     Now, you may see, C1 and C2 now has only one tunnel to one network node,
     C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.

   (This is because that HA router qr-device port did not change the binding host.
   Although the HA router is multiple active, the `network:router_interface` port
   could only have one ml2_port_bindings host.)

Please make sure this step was not involved DHCP port.

3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network
   phenomenon: You may see all HA router could back to one active and one standby.
   Assuming:R1's master is in N2.
             R2's master is in N1.

Then problem comes:

remember the topology:
    R1 -- NET1 -- VM1 (reside in C1)
    R2 -- NET2 -- VM2 (reside in C2)
and the tunnels:
    C1 has only one tunnel to N1.
    C2 has only one tunnel to N2.

In other words:
C1 has no tunnel to N2 where the R1's master is in.
C2 has no tunnel to N1 where the R2's master is in.

Finally, TRAFFIC IS NOT REACHABLE.

And restart compute node OVS-agent could not solve the issue,
because we enabled the l2_population,
and there is no new port plugged, so that means:
1.tunnel from C1 to N2 could not be created automatically;
2.tunnel from C2 to N1 could not be created automatically.

One way to recover is to restart N1's neutron services and then restart N2's neutron services.
This may cause HA router's master are all residing in one network node.

Production environment problem:
We got some unknow link down message which here we use ifdown/ifup to simulate:
[ +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down
[ +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Revision history for this message
LIU Yulong (dragon889) wrote :

One potential solution:
if new port is plugged in compute host, let l2_pop create tunnels to all related network nodes where HA router resided.

description: updated
Revision history for this message
LIU Yulong (dragon889) wrote :
Revision history for this message
LIU Yulong (dragon889) wrote :

Duplicated mark removed.

Revision history for this message
LIU Yulong (dragon889) wrote :
Changed in neutron:
status: New → Incomplete
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.