Activity log for bug #1637452

Date Who What changed Old value New value Message
2016-10-28 09:38:55 LIU Yulong bug added bug
2016-10-28 09:50:10 LIU Yulong description Description: If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up. And there is no way to quickly recover. ENV: stable/mitaka (8.1.2 with backported L3 patches) VXLAN Some related settings: l2_population = true arp_responder = true tunnel_types = vxlan prevent_arp_spoofing = true l3_ha = True max_l3_agents_per_router = 2 min_l3_agents_per_router = 2 How to reproduce: Assuming we have and only have: two network nodes: N1 and N2 two compute nodes: C1 and C2 two HA routers: R1 and R2 two networks/subnets for VM: NET1 and NET2 Steps: 1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network phenomenon: You may see all HA router become both active. AKA, R1 and R2 are now all active state in both N1 and N2. 2.create VM1 and VM2 (You can use `nova boot --availability-zone:specific_host` to create VM to the compute node you needed.) Then the topology will be: R1 -- NET1 -- VM1 (reside in C1) R2 -- NET2 -- VM2 (reside in C2) phenomenon: Now, you may see, C1 and C2 now has only one tunnel to one network node, C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2. (This is because that HA router qr-device port did not change the binding host. Although the HA router is multiple active, the `network:router_interface` port could only have one ml2_port_bindings host.) 3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network phenomenon: You may see all HA router could back to one active and one standby. Assuming:R1's master is in N2. R2's master is in N1. Then problem comes: remember the topology: R1 -- NET1 -- VM1 (reside in C1) R2 -- NET2 -- VM2 (reside in C2) and the tunnels: C1 has only one tunnel to N1. C2 has only one tunnel to N2. In other words: C1 has no tunnel to N2 where the R1's master is in. C2 has no tunnel to N1 where the R2's master is in. Finally, TRAFFIC IS NOT REACHABLE. And restart compute node OVS-agent could not solve the issue, because we enabled the l2_population, and there is no new port plugged, so that means: 1.tunnel from C1 to N2 could not be created automatically; 2.tunnel from C2 to N1 could not be created automatically. One way to recover is to restart N1's neutron services and then restart N2's neutron services. This may cause HA router's master are all residing in one network node. Production environment problem: We got some unknow link down message which here we use ifdown/ifup to simulate: [ +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down [ +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX Description: If vxlan NIC of HA network for HA router is down and up, the tunnel may mess up. And there is no way to quickly recover. ENV: stable/mitaka (8.1.2 with backported L3 patches) VXLAN Some related settings: l2_population = true arp_responder = true tunnel_types = vxlan prevent_arp_spoofing = true l3_ha = True max_l3_agents_per_router = 2 min_l3_agents_per_router = 2 How to reproduce: Assuming we have and only have: two network nodes: N1 and N2 two compute nodes: C1 and C2 two HA routers: R1 and R2 two networks/subnets for VM: NET1 and NET2 Steps: 1. `ifdown` the vxlan NICs of N1 and N2 which were used for HA network     phenomenon:       You may see all HA router become both active.       AKA, R1 and R2 are now all active state in both N1 and N2. 2.create VM1 and VM2    (You can use `nova boot --availability-zone:specific_host`     to create VM to the compute node you needed.)    Then the topology will be:    R1 -- NET1 -- VM1 (reside in C1)    R2 -- NET2 -- VM2 (reside in C2)    phenomenon:      Now, you may see, C1 and C2 now has only one tunnel to one network node,      C1 has VXLAN tunnel to N1, C2 has VXLAN tunnel to N2.    (This is because that HA router qr-device port did not change the binding host.    Although the HA router is multiple active, the `network:router_interface` port    could only have one ml2_port_bindings host.) Please make sure this step was not involved DHCP port. 3. `ifup` the vxlan NICs of N1 and N2 which were used for HA network    phenomenon: You may see all HA router could back to one active and one standby.    Assuming:R1's master is in N2.              R2's master is in N1. Then problem comes: remember the topology:     R1 -- NET1 -- VM1 (reside in C1)     R2 -- NET2 -- VM2 (reside in C2) and the tunnels:     C1 has only one tunnel to N1.     C2 has only one tunnel to N2. In other words: C1 has no tunnel to N2 where the R1's master is in. C2 has no tunnel to N1 where the R2's master is in. Finally, TRAFFIC IS NOT REACHABLE. And restart compute node OVS-agent could not solve the issue, because we enabled the l2_population, and there is no new port plugged, so that means: 1.tunnel from C1 to N2 could not be created automatically; 2.tunnel from C2 to N1 could not be created automatically. One way to recover is to restart N1's neutron services and then restart N2's neutron services. This may cause HA router's master are all residing in one network node. Production environment problem: We got some unknow link down message which here we use ifdown/ifup to simulate: [ +4.999840] ixgbe 0000:03:00.0 eth2: NIC Link is Down [ +1.926155] ixgbe 0000:03:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
2016-10-28 11:40:26 LIU Yulong marked as duplicate 1636466
2016-10-31 03:06:57 LIU Yulong removed duplicate marker 1636466
2016-11-09 07:14:55 LIU Yulong neutron: status New Incomplete
2016-11-09 07:15:02 LIU Yulong neutron: status Incomplete Invalid