L3 agent fails on FIP when DVR and HA both enabled in router

Bug #1614337 reported by Jonathan Mills
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
High
Swaminathan Vasudevan

Bug Description

I have a vlan-based Neutron configuration. My tenant networks are vlans, and my shared external network (br-ex) is a flat network. Neutron is configured for DVR+SNAT mode. In testing floating IPs, I've run into issues with my neutron router, and I've traced it back to a single scenario: when the router is both distributed AND ha. To be clear, I've tested all four possibilities:

"--distributed False --ha False" == works
"--distributed True --ha False" == works
"--distributed False --ha True" == works
"--distributed True --ha True" == fails

* I can reproduce this again and again, just by deleting the router I have (which implies first clearing its gateway, and removing any associated ports), then re-creating the router in any of the four configurations above. Then I boot some VMs, associate a FIP to any one of them, and attempt to reach the FIP. Results are the same whether I create the router on the CLI or from within Horizon.

* Expected result is that I should be able to associate a floating IP to a running VM and then ping that floating IP (and ultimately other kinds of activity, such as SSH access to the VM).

* Actual result is that the floating IP is completely unreachable from other valid IPs within same L2 space. Additionally, in /var/log/neutron/l3-agent.log on the compute node hosting the VM whose associated FIP I can't reach, I find this:

2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '13356ddb-8e36-4f54-b8b2-6a62a5aecf86'
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self._process_updated_router(router)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent ri.process(self)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_edge_ha_router.py", line 92, in process
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent super(DvrEdgeHaRouter, self).process(agent)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 488, in process
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent super(DvrLocalRouter, self).process(agent)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_router_base.py", line 30, in process
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent super(DvrRouterBase, self).process(agent)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 386, in process
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent super(HaRouter, self).process(agent)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 385, in call
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self.logger(e)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self.force_reraise()
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 382, in call
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent return func(*args, **kwargs)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 961, in process
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self._process_internal_ports(agent.pd)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 478, in _process_internal_ports
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self.internal_network_added(p)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_edge_ha_router.py", line 58, in internal_network_added
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent dvr_snat_ns.SNAT_INT_DEV_PREFIX)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 280, in _plug_ha_router_port
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent self._disable_ipv6_addressing_on_interface(interface_name)
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 239, in _disable_ipv6_addressing_on_interface
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 221, in _should_delete_ipv6_lladdr
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent if manager.get_process().active:
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_process'
2016-08-17 22:33:25.512 11369 ERROR neutron.agent.l3.agent

* Version
** CentOS 7.2
** Kernel 3.10.0-327.18.2.el7.x86_64
** Mitaka from RDO rpms, puppet managed
** Neutron RPMS:

openstack-neutron-8.1.2-1.el7.noarch
openstack-neutron-common-8.1.2-1.el7.noarch
openstack-neutron-fwaas-8.0.0-3.el7.noarch
openstack-neutron-ml2-8.1.2-1.el7.noarch
openstack-neutron-openvswitch-8.1.2-1.el7.noarch
python-neutron-8.1.2-1.el7.noarch
python-neutronclient-4.1.1-2.el7.noarch
python-neutron-fwaas-8.0.0-3.el7.noarch
python-neutron-lib-0.0.2-1.el7.noarch

* Environment
* 1 controller (running neutron-server, but no other neutron components)
* 2 dedicated network nodes for neutron agents
* N compute nodes running neutron l3-agent because of dvr_snat mode

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Swami: any chance you can triage this?

tags: added: l3-dvr-backlog l3-ha
Changed in neutron:
importance: Undecided → High
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Jonathan: any chance you can provide server and l3 agent logs?

tags: added: mitaka-backport-potential
Changed in neutron:
status: New → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

We mostly exercise this path using VXLAN, I am not sure why the VLAN case might be different but I can expect that things do not work exactly how one would expect. If you could try this out on VXLAN to rule out any other potential configuration error, it would be great.

Revision history for this message
Dongcan Ye (hellochosen) wrote :

@Jonathan: Do you mean delete router and then create another router? Or you can provide how to reproduce in detail and more logs.
My enviroment is also VLAN mode, router mode is ha + dvr, but I can't reproduce, FIP works well.

Revision history for this message
Jonathan Mills (jbm212) wrote :

@Dungcan: Sorry if I was less than clear. I mean that I'm testing one router configuration at a time. So if I'm going to test with dvr and ha both off, then I remove any existing routers, and create one for that configuration, and then test. And then if I want to test dvr on and ha off, I again remove any existing routers, create one with dvr on and ha off, and test. wash, rinse, repeat. Is that more clear?

I absolutely believe that ha + dvr is working for you. I'm wondering if my issue is specific to CentOS 7, or to RDO packages, or to my (mis-)configuration? I will try to follow up with some examples, but I may not try to reproduce every possible scenario, as that will take awhile.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

@armax Yes I will try to look into it and triage it.

Revision history for this message
Jonathan Mills (jbm212) wrote :

Okay, this is an example of how the failure manifests itself with the router has both DVR and HA enabled:

http://paste.openstack.org/show/560821/

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

It seems that the keepalivedmanager is returning noneobject and that's the reason it is failing at https://github.com/openstack/neutron/blob/0f3008dd3251a2b6436ebb7af240fe4c37b0cc42/neutron/agent/l3/ha_router.py#L219

Revision history for this message
Jonathan Mills (jbm212) wrote :

@Swaminathan: Thanks for the feedback. What can I do about it? I can't help but notice that the node where the FIP succeeds (dscn1066) is one of the nodes participating in the keepalived HA:

# neutron l3-agent-list-hosting-router atadm-router5
+--------------------------------------+----------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+----------------------------------------+----------------+-------+----------+
| 1e4b71a9-bf96-4a3b-9f54-5198d2bbebc4 | dscn1066.openstack.adapt.nccs.nasa.gov | True | :-) | active |
| b83fec5a-2d88-47b8-8f77-12ac88405c3e | dscn1021.openstack.adapt.nccs.nasa.gov | True | :-) | standby |
| 98c6882c-3063-4d16-87cc-c6236fb71a98 | dscn1065 | True | :-) | active |
+--------------------------------------+----------------------------------------+----------------+-------+----------+

All the other nodes, where FIPs are failing, are not participating in this HA arrangement. So this begs the question...is it a communication failure? How do these nodes talk to keepalivedmanager? And thus is it possible iptables may be blocking something?

Revision history for this message
Assaf Muller (amuller) wrote :

There must be a different in the code that initializes the keepalived manager with a regular HA router and an HA+DVR router. I'm surprised this wasn't caught via functional testing, meaning that maybe the reproduction step isn't trivial.

@Jonathan, can I ask you to find the simplest / minimal way to reproduce this?

Revision history for this message
Jonathan Mills (jbm212) wrote :

@Assaf: "simplest / minimal way to reproduce this" -- hard for me to know where to begin. I don't run DevStack, I've only tested this on bare metal, across many nodes -- that's my environment. I start by provisioning bare metal with CentOS 7, and installing Mitaka from RDO packages using Packstack. Let's say 1 controller, 1 network node, and 1 more compute node than the number of nodes in the HA/keepalived setup (so, let's say four). You then must modify config files for DVR+SNAT (as packstack won't do that for you). Then I create my internal and external networks and subnets, and create a router with DVR and HA enabled. I boot some number of VMs and assign FIPs, and then try to reach the VMs via FIP. What I seem to be seeing is that it fails most of the time, but not always.

Revision history for this message
Jonathan Mills (jbm212) wrote :

I've deleted the last dvr+ha router I was testing with, and created a new one, because I wanted the keepalived instances to fall on different nodes, and then see if my observance follows. My theory is that it works fine if the VM falls on a compute node where keepalived is running, and that it fails massively otherwise (and it does). Here's results:

http://paste.openstack.org/show/560853/

Revision history for this message
Jonathan Mills (jbm212) wrote :

Alright, I've had some time to dig deeper into this, and I think I may understand the situation. Someone please correct me if I'm wrong, but it sure seems like when Neutron is in DVR + HA L3 + SNAT mode, every single nova-compute node needs to have an HA port assignment. If that isn't the case, then you end up with this error in l3-agent.log:

2016-08-18 16:49:26.418 15621 ERROR neutron.agent.l3.ha_router [-] Unable to process HA router 1952e774-ca97-4cdf-a8e9-3dc2c9d3d6c0 without HA port

Does that sound right to everyone (and I do apologize for my ignorance, if this should have been obvious).

So, here's what's tricky, and where I got tripped up. Maybe it's the default setting, or maybe it's PackStack's default setting, but in /etc/neutron/neutron.conf (on the system running neutron-server) you find this:

# Maximum number of L3 agents which a HA router will be scheduled on. If it is
# set to 0 then the router will be scheduled on every agent. (integer value)
#max_l3_agents_per_router = 3

and even though it's commented out, it must be the default, because it was setting max_l3_agents_per_router to 3. You could clearly see that in some of my cli output in earlier posts.

Nevertheless, nova scheduler is happy to place VMs on nova-compute nodes that don't have an HA port, which is what results in the "Unable to process HA router 1952e774-ca97-4cdf-a8e9-3dc2c9d3d6c0 without HA port" error.

I have tested setting max_l3_agents_per_router = 0 in neutron.conf on my neutron-server box, and rebuilding my router. It does in fact result in an HA port assigned to every hypervisor. This results in working FIPs for any number of VMs I've been able to create.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Thanks for your input. I have asked adolfo to provide some input on the impact of 'max_l3_agent_per_router' and HA routers.

Revision history for this message
Assaf Muller (amuller) wrote :

The expected behavior is that it should work perfectly fine with max_l3_agent_per_router set to 2 or 3, and that you don't need an HA port on every node, only on nodes where the l3 agent mode is set to 'dvr_snat', nodes commonly referred to as network nodes. These are nodes your SNAT traffic (North/south VM traffic for VMs without a floating IP) will go through, and these are the ones you provide HA for when you create a DVR router with ha=True.

You should be able to set max_l3_agent_per_router back to 3, and check for errors on nodes that did not schedule the DVR+HA router.

Revision history for this message
Jonathan Mills (jbm212) wrote :

Assaf:

Thanks for this follow-up. Indeed, my mistake is clear now. On every node running an l3-agent, I had the agent mode set to dvr_snat. I now have dvr_snat set only on l3 agents on network nodes. On compute nodes, my l3 agent mode is simply 'dvr'. I now see the expected behavior when 'max_l3_agent_per_router' is set to 3.

Thank you all for your assistance with this!

Jonathan

Changed in neutron:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.