neutron

Floating ip Is not reachable when data nic is made up again ( in testing L3 HA VRRP feature)

Bug #1554942 reported by Abhishek G M on 2016-03-09

This bug report is a duplicate of: Bug #1375625: Problem in l3-agent tenant-network interface would cause split-brain in HA router. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	In Progress	Medium	Sonu

Bug Description

steps to reproduce:
1) create a router
2) set router gateway with the public network
3) router-interface-add with the submnet
4) boot a vm
5) create a floating ip
6) associate a floating ip with the booted vm port id
7) continuously ping floating ip
8) Make active controller data nic down
9) Make Previous active controller data nic up
Floating ip is not reachable after doing above steps

Before stating testing scenario
stack@padawan-ccp-c1-m1-mgmt:~$ neutron l3-agent-list-hosting-router b4fd1fd4-1c21-4be3-8672-e085c3e195f9
+--------------------------------------+------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------+----------------+-------+----------+
| da302b3f-b8c6-4068-9abd-6486cd6d0654 | padawan-ccp-c1-m3-mgmt | True | :-) | standby |
| fc1cd20b-81dd-4bbc-bfb8-5652c08dc0c3 | padawan-ccp-c1-m1-mgmt | True | :-) | active |
+--------------------------------------+------------------------+----------------+-------+----------+
After making data niic down of the active controller
stack@padawan-ccp-c1-m1-mgmt:~$ neutron l3-agent-list-hosting-router b4fd1fd4-1c21-4be3-8672-e085c3e195f9
+--------------------------------------+------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------+----------------+-------+----------+
| da302b3f-b8c6-4068-9abd-6486cd6d0654 | padawan-ccp-c1-m3-mgmt | True | :-) | active |
| fc1cd20b-81dd-4bbc-bfb8-5652c08dc0c3 | padawan-ccp-c1-m1-mgmt | True | :-) | active |
+--------------------------------------+------------------------+----------------+-------+----------+
After making data nic up of the previous active controller
stack@padawan-ccp-c1-m1-mgmt:~$ neutron l3-agent-list-hosting-router b4fd1fd4-1c21-4be3-8672-e085c3e195f9
+--------------------------------------+------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------+----------------+-------+----------+
| da302b3f-b8c6-4068-9abd-6486cd6d0654 | padawan-ccp-c1-m3-mgmt | True | :-) | standby |
| fc1cd20b-81dd-4bbc-bfb8-5652c08dc0c3 | padawan-ccp-c1-m1-mgmt | True | :-) | active |
+--------------------------------------+------------------------+----------------+-------+----------+
Below is the ping output of the floating ip
64 bytes from 172.21.11.11: icmp_seq=1 ttl=62 time=1.79 ms
64 bytes from 172.21.11.11: icmp_seq=2 ttl=62 time=1.52 ms
64 bytes from 172.21.11.11: icmp_seq=3 ttl=62 time=0.859 ms
64 bytes from 172.21.11.11: icmp_seq=4 ttl=62 time=0.937 ms
64 bytes from 172.21.11.11: icmp_seq=5 ttl=62 time=0.645 ms
64 bytes from 172.21.11.11: icmp_seq=6 ttl=62 time=1.45 ms
64 bytes from 172.21.11.11: icmp_seq=7 ttl=62 time=0.804 ms
64 bytes from 172.21.11.11: icmp_seq=8 ttl=62 time=0.872 ms
64 bytes from 172.21.11.11: icmp_seq=9 ttl=62 time=0.780 ms
64 bytes from 172.21.11.11: icmp_seq=10 ttl=62 time=0.719 ms
64 bytes from 172.21.11.11: icmp_seq=11 ttl=62 time=1.19 ms
64 bytes from 172.21.11.11: icmp_seq=12 ttl=62 time=1.40 ms
64 bytes from 172.21.11.11: icmp_seq=13 ttl=62 time=0.633 ms
64 bytes from 172.21.11.11: icmp_seq=14 ttl=62 time=0.888 ms
64 bytes from 172.21.11.11: icmp_seq=15 ttl=62 time=1.76 ms
64 bytes from 172.21.11.11: icmp_seq=16 ttl=62 time=0.906 ms
64 bytes from 172.21.11.11: icmp_seq=17 ttl=62 time=0.727 ms
64 bytes from 172.21.11.11: icmp_seq=18 ttl=62 time=1.38 ms
64 bytes from 172.21.11.11: icmp_seq=19 ttl=62 time=0.719 ms
64 bytes from 172.21.11.11: icmp_seq=20 ttl=62 time=0.778 ms
64 bytes from 172.21.11.11: icmp_seq=21 ttl=62 time=1.58 ms
64 bytes from 172.21.11.11: icmp_seq=22 ttl=62 time=0.717 ms
64 bytes from 172.21.11.11: icmp_seq=23 ttl=62 time=0.782 ms
64 bytes from 172.21.11.11: icmp_seq=24 ttl=62 time=1.01 ms
64 bytes from 172.21.11.11: icmp_seq=25 ttl=62 time=0.709 ms
64 bytes from 172.21.11.11: icmp_seq=26 ttl=62 time=1.03 ms
64 bytes from 172.21.11.11: icmp_seq=27 ttl=62 time=0.826 ms
64 bytes from 172.21.11.11: icmp_seq=28 ttl=62 time=0.715 ms
64 bytes from 172.21.11.11: icmp_seq=29 ttl=62 time=0.974 ms
64 bytes from 172.21.11.11: icmp_seq=30 ttl=62 time=0.826 ms
64 bytes from 172.21.11.11: icmp_seq=39 ttl=62 time=3.61 ms
64 bytes from 172.21.11.11: icmp_seq=40 ttl=62 time=0.841 ms
64 bytes from 172.21.11.11: icmp_seq=41 ttl=62 time=0.824 ms

And gateway also not pinging.
Floating ip works from router namespace but not from outside
stack@padawan-ccp-c1-m1-mgmt:~$ sudo ip netns exec qrouter-b4fd1fd4-1c21-4be3-8672-e085c3e195f9 bash
root@padawan-ccp-c1-m1-mgmt:/home/stack# ping 172.21.11.11
PING 172.21.11.11 (172.21.11.11) 56(84) bytes of data.
64 bytes from 172.21.11.11: icmp_seq=1 ttl=64 time=2.01 ms
64 bytes from 172.21.11.11: icmp_seq=2 ttl=64 time=1.17 ms

Tags:

Revision history for this message

Sonu (sonu-sudhakaran) wrote on 2016-03-09:

I have noticed this behavior.

Ping stops when you bring the NIC back up again on the original master.
When NIC goes down on the current master, router ownership is taken by the stand-by controller voluntarily and is expected.
However, the original master doesn't realize that his data network is down, and continues to keep the Mastership of the routers.
This is a split brain situation, where the original master still thinks he is the master of the router, since he is oblivion to the data path connectivity problems, whereas by VRRP protocol, stand-by controller have gained Mastership and serving Ping requests.

And when the NIC on the master is back up again, the router on the original master doesn't have to have a state transition (since as per router stat known to him, he was always the master). And since there is not state transition, gratuitous ARPs are not sent. And this results in FIP stop pinging.

The split brain can be prevented if we ensure the health check of master. And I think the patch https://review.openstack.org/#/c/273546/ tries to solve this.

However, workaround is, just restart L3 agent on the master after bringing up the NIC. Ping resumes, since a router failover will happen.

One another solution is to repeat the GARPs in master keepalived at regular intervals.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-09: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/290328

Changed in neutron:
assignee:	nobody → Sonu (sonu-sudhakaran)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-09: Change abandoned on neutron (master)

Change abandoned by Sonu (<email address hidden>) on branch: master
Review: https://review.openstack.org/290328
Reason: re-introduces the IPv6 related networking bug - https://bugs.launchpad.net/neutron/+bug/1520517

Ihar Hrachyshka (ihar-hrachyshka) on 2016-03-09

Changed in neutron:
importance:	Undecided → Medium
tags:	added: l3-ha

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1375625 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.