Floating ip Is not reachable when data nic is made up again ( in testing L3 HA VRRP feature)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
In Progress
|
Medium
|
Sonu |
Bug Description
steps to reproduce:
1) create a router
2) set router gateway with the public network
3) router-
4) boot a vm
5) create a floating ip
6) associate a floating ip with the booted vm port id
7) continuously ping floating ip
8) Make active controller data nic down
9) Make Previous active controller data nic up
Floating ip is not reachable after doing above steps
Before stating testing scenario
stack@padawan-
+------
| id | host | admin_state_up | alive | ha_state |
+------
| da302b3f-
| fc1cd20b-
+------
After making data niic down of the active controller
stack@padawan-
+------
| id | host | admin_state_up | alive | ha_state |
+------
| da302b3f-
| fc1cd20b-
+------
After making data nic up of the previous active controller
stack@padawan-
+------
| id | host | admin_state_up | alive | ha_state |
+------
| da302b3f-
| fc1cd20b-
+------
Below is the ping output of the floating ip
64 bytes from 172.21.11.11: icmp_seq=1 ttl=62 time=1.79 ms
64 bytes from 172.21.11.11: icmp_seq=2 ttl=62 time=1.52 ms
64 bytes from 172.21.11.11: icmp_seq=3 ttl=62 time=0.859 ms
64 bytes from 172.21.11.11: icmp_seq=4 ttl=62 time=0.937 ms
64 bytes from 172.21.11.11: icmp_seq=5 ttl=62 time=0.645 ms
64 bytes from 172.21.11.11: icmp_seq=6 ttl=62 time=1.45 ms
64 bytes from 172.21.11.11: icmp_seq=7 ttl=62 time=0.804 ms
64 bytes from 172.21.11.11: icmp_seq=8 ttl=62 time=0.872 ms
64 bytes from 172.21.11.11: icmp_seq=9 ttl=62 time=0.780 ms
64 bytes from 172.21.11.11: icmp_seq=10 ttl=62 time=0.719 ms
64 bytes from 172.21.11.11: icmp_seq=11 ttl=62 time=1.19 ms
64 bytes from 172.21.11.11: icmp_seq=12 ttl=62 time=1.40 ms
64 bytes from 172.21.11.11: icmp_seq=13 ttl=62 time=0.633 ms
64 bytes from 172.21.11.11: icmp_seq=14 ttl=62 time=0.888 ms
64 bytes from 172.21.11.11: icmp_seq=15 ttl=62 time=1.76 ms
64 bytes from 172.21.11.11: icmp_seq=16 ttl=62 time=0.906 ms
64 bytes from 172.21.11.11: icmp_seq=17 ttl=62 time=0.727 ms
64 bytes from 172.21.11.11: icmp_seq=18 ttl=62 time=1.38 ms
64 bytes from 172.21.11.11: icmp_seq=19 ttl=62 time=0.719 ms
64 bytes from 172.21.11.11: icmp_seq=20 ttl=62 time=0.778 ms
64 bytes from 172.21.11.11: icmp_seq=21 ttl=62 time=1.58 ms
64 bytes from 172.21.11.11: icmp_seq=22 ttl=62 time=0.717 ms
64 bytes from 172.21.11.11: icmp_seq=23 ttl=62 time=0.782 ms
64 bytes from 172.21.11.11: icmp_seq=24 ttl=62 time=1.01 ms
64 bytes from 172.21.11.11: icmp_seq=25 ttl=62 time=0.709 ms
64 bytes from 172.21.11.11: icmp_seq=26 ttl=62 time=1.03 ms
64 bytes from 172.21.11.11: icmp_seq=27 ttl=62 time=0.826 ms
64 bytes from 172.21.11.11: icmp_seq=28 ttl=62 time=0.715 ms
64 bytes from 172.21.11.11: icmp_seq=29 ttl=62 time=0.974 ms
64 bytes from 172.21.11.11: icmp_seq=30 ttl=62 time=0.826 ms
64 bytes from 172.21.11.11: icmp_seq=39 ttl=62 time=3.61 ms
64 bytes from 172.21.11.11: icmp_seq=40 ttl=62 time=0.841 ms
64 bytes from 172.21.11.11: icmp_seq=41 ttl=62 time=0.824 ms
And gateway also not pinging.
Floating ip works from router namespace but not from outside
stack@padawan-
root@padawan-
PING 172.21.11.11 (172.21.11.11) 56(84) bytes of data.
64 bytes from 172.21.11.11: icmp_seq=1 ttl=64 time=2.01 ms
64 bytes from 172.21.11.11: icmp_seq=2 ttl=64 time=1.17 ms
Changed in neutron: | |
importance: | Undecided → Medium |
tags: | added: l3-ha |
I have noticed this behavior.
Ping stops when you bring the NIC back up again on the original master.
When NIC goes down on the current master, router ownership is taken by the stand-by controller voluntarily and is expected.
However, the original master doesn't realize that his data network is down, and continues to keep the Mastership of the routers.
This is a split brain situation, where the original master still thinks he is the master of the router, since he is oblivion to the data path connectivity problems, whereas by VRRP protocol, stand-by controller have gained Mastership and serving Ping requests.
And when the NIC on the master is back up again, the router on the original master doesn't have to have a state transition (since as per router stat known to him, he was always the master). And since there is not state transition, gratuitous ARPs are not sent. And this results in FIP stop pinging.
The split brain can be prevented if we ensure the health check of master. And I think the patch https:/ /review. openstack. org/#/c/ 273546/ tries to solve this.
However, workaround is, just restart L3 agent on the master after bringing up the NIC. Ping resumes, since a router failover will happen.
One another solution is to repeat the GARPs in master keepalived at regular intervals.