Subnet without default gateway breaks DHCP agent

Bug #1541490 reported by Logan V
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
networking-calico
In Progress
Undecided
Andriy Popovych

Bug Description

I am seeing this on Kilo, but based on the conversation from slack I expect to see the breakage post-Kilo also.

We have a situation where we have some instances that have 2 NICs. One is a public internet connection where we serve a default route via DHCP.

The second NIC is a private subnet on 10.0.0.0/8 where we cannot serve a default route, otherwise it will break public internet connectivity inside the instance. We use an "extra route" in Neutron to serve the appropriate routes we want the instances to have for the private subnets. Ie. Destination 10.0.0.0/8 : Next hop 10.13.64.1

Neutron has support for removing the default route from a subnet, however this breaks Calico in that Calico expects to bind the gateway address to the ns-* interface for the subnet so dnsmasq can receive DHCP requests for the subnet. When you do not specify a gateway on a Neutron subnet, neutron-dhcp-agent fails to start:

2016-02-03 09:48:14.656 4137648 INFO neutron.agent.dhcp.agent [-] Synchronizing state
2016-02-03 09:48:14.785 4137648 ERROR neutron.agent.dhcp.agent [-] Unable to enable dhcp for 2caf66f9-1625-49f2-b1f9-105845c76fac.
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent Traceback (most recent call last):
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 131, in call_driver
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 208, in enable
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent self.spawn_process()
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 432, in spawn_process
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=False)
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 441, in _spawn_or_reload_process
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent self._output_config_files()
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 461, in _output_config_files
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent self._output_opts_file()
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 705, in _output_opts_file
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent options, subnet_index_map = self._generate_opts_per_subnet()
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent File "/usr/local/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py", line 754, in _generate_opts_per_subnet
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent subnet_dhcp_ip = subnet_to_interface_ip[subnet.id]
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent KeyError: u'5152ed85-042e-4374-9577-c1b061ec5e23'
2016-02-03 09:48:14.785 4137648 TRACE neutron.agent.dhcp.agent
2016-02-03 09:48:14.786 4137648 INFO neutron.agent.dhcp.agent [-] Synchronizing state complete

Changed in networking-calico:
assignee: nobody → Aleksey Kasatkin (alekseyk-ru)
Changed in networking-calico:
status: New → Confirmed
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

I'm looking to this on Newton and I see similar issue. I don't se such traces but problem remains the same: subnet without a gateway cannot be used with dhcp.

Looks like it is by design:

def use_gateway_ips(self):
    ...
    return True

But you can use host-route of subnet if it suits your needs.

I suppose, support for subnet without gateway can be implemented but it's not just about changing of use_gateway_ips to False.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-calico (master)

Fix proposed to branch: master
Review: https://review.openstack.org/369310

Changed in networking-calico:
status: Confirmed → In Progress
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Seems that host routes do not work with calico at the moment. But it was found that it is enough to assign an IP from such subnet to ns-dhcp to make dhcp operational. The remaining question is how to assign IP to dhcp port from agent.

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Cannot set fixed_ips for DHCP port in Neutron: https://bugs.launchpad.net/neutron/+bug/1625209

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Cannot set fixed_ips for DHCP port in Neutron: https://bugs.launchpad.net/neutron/+bug/1625209
That bug should be fixed first to proceed with current issue or calico dhcp agent should be redesigned to not use gateway IPs always (use_gateway_ips == False).

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Hi Aleksey - Thanks for looking at this problem; here are some thoughts from me that I hope will clarify the history.

Yes, it is by design that Calico operation tells the DHCP agent to use the subnet gateway IP for the DHCP port, instead of allocating a new unique IP from Neutron. The historical reasons for this are:

1. We do not want to allocate a unique IP for each (Neutron subnet, DHCP agent), because with Calico there is a DHCP agent on every host. (Whereas with bridged+tunneled networking there is only 1 or 2 DHCP agents per Neutron network.) Allocating a unique IP for each DHCP agent would use up too many IPs, when using Calico.

2. We did not realise that there was a use case for having a subnet without a gateway IP.

3. Apart from breaking that use case, in all other ways it works very well to use the gateway IP for the DHCP port.

However, as is now clear from this bug and its duplicate, the use case without a gateway IP is important, and I'd like Calico to support it. Broadly speaking, I'd like to address it by allocating one unique IP per subnet, from Neutron, and using that IP for that subnet on the DHCP port on all compute hosts. But I haven't yet looked at all at the detail of achieving that.

You also say that host routes are not supported with Calico, but I believe that is now untrue - please see http://git.openstack.org/cgit/openstack/networking-calico/commit/?id=b0f751b86b9eba7ea3deee51fc0d81e0152a9c17.

I hope that's useful; please do let me know your thoughts on all this.

Changed in networking-calico:
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on networking-calico (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/369310
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in networking-calico:
assignee: Aleksey Kasatkin (alekseyk-ru) → Alexander Saprykin (cutwater)
Changed in networking-calico:
status: Confirmed → In Progress
Revision history for this message
Alexander Saprykin (cutwater) wrote :

Hi Neil,

Currently I'm working on this bug and have several ideas how it can be addressed. As you pointed out creating DHCP port per agent will lead to IP address waste. It can be resolved by creating single DHCP port with fixed IP per subnet.
To address this issue we can create single DHCP port per network and reuse it by all DHCP agents.

The main problem here is synchronisation between DHCP agents to avoid possible race conditions on port creation operation. There is an option to create reserved DHCP port in neutron agent (with device_owner = "reserved_dhcp_port"), so it will be created by neturon server itself and it will exist by the time when DHCP agent requires it. But I'm not sure that this is the right way how to solve it.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Hi Alex (and also Logan, since you originally raised this bug)... I'll like to make a couple of points, about the overall use case, and about how we might solve it.

First, regarding the use case - I'd like us to be really clear about what the need is for multiple NICs into the same instance, with Calico networking.

I recently worked on this myself, for a customer, but (a) I think that customer might now be reconsidering whether they really need it, and (b) when I referred to this work on the OpenStack ML [1], I got a surprisingly strong response questioning why it was needed [2].

[1] http://lists.openstack.org/pipermail/openstack/2016-August/017296.html
[2] http://lists.openstack.org/pipermail/openstack/2016-August/017312.html

So, I think we should be careful about doing this work - especially if it is complex - unless we are really sure that it is needed.

Secondly, regarding the possible solution. Your mention of synchronization does indeed sound like the key problem that we have to address, and it may be that using the 'reserved' port could help there. The only other idea in my mind is that we don't necessarily need to create a port that Neutron knows about; it should be sufficient just to reserve an IP address from Neutron, and the port that we use could continue to be as currently faked by the FakePlugin class.

Revision history for this message
Logan V (loganv) wrote :

This is a little snowflakey but here is how I have solved this for my deployment: http://cdn.pasteraw.com/ji09n1p7pdpbfxwcout8j8o0n5vpvve

ie. I don't push a default gateway over DHCP unless a host route toward 0.0.0.0/0 exists in the host routes for that subnet.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

@Logan - I very much like the simplicity of that code change. But doesn't it break the mainline single NIC case? In other words, if the mainline case is for an instance to attach to a single network with a subnet with gateway IP and DHCP enabled, I think your change means that people using that case would also need to add a host route (0.0.0.0/0, <gateway IP>). (And that without that addition, an instance would not be able to route out of its NIC to the Internet.)

Revision history for this message
Logan V (loganv) wrote :

Yes Neil that is correct. It is not the best solution generally but it works great for me. Unfortunately it is not appropriate for general use.

However I do wonder if you even need to use an on-link DHCP IP. Couldn't you just add some RFC1918 loopback to the ns-dhcp interface and receive the broadcasts in dnsmasq? Then you could support the normal "clear gateway" use case in Neutron that is already supported.

Revision history for this message
Alexander Saprykin (cutwater) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/#/c/392645/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-calico (master)

Reviewed: https://review.openstack.org/396071
Committed: https://git.openstack.org/cgit/openstack/networking-calico/commit/?id=f0456addc021ea96b97a29569a0f4bfe626c943d
Submitter: Jenkins
Branch: master

commit f0456addc021ea96b97a29569a0f4bfe626c943d
Author: Alexander Saprykin <email address hidden>
Date: Thu Nov 10 11:13:41 2016 +0100

    Refactoring: Organize imports

    * Imports incompatible between OpenStack versions moved to
      networking_calico.compat module
    * Replace object imports with module imports

    Related-Bug: #1541490
    Change-Id: I8afcc6ddf84fb0705811ad346a46c06d3120eb2e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/396090
Committed: https://git.openstack.org/cgit/openstack/networking-calico/commit/?id=98deab5d05c4896157bf8efac64bdb6626cd0497
Submitter: Jenkins
Branch: master

commit 98deab5d05c4896157bf8efac64bdb6626cd0497
Author: Alexander Saprykin <email address hidden>
Date: Thu Nov 10 11:38:32 2016 +0100

    Refactoring: Make code decomposition

    * Move code responsible for endpoint creation and update
      to reusable functions `_create_endpoint`, `_update_endpoint`

    Related-Bug: #1541490
    Change-Id: I44d7d271db9e2b48f3a4fa08572efc6712144492

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on networking-calico (master)

Change abandoned by Alexander Saprykin (<email address hidden>) on branch: master
Review: https://review.openstack.org/391239
Reason: Squashed with https://review.openstack.org/#/c/392645/

Changed in networking-calico:
assignee: Alexander Saprykin (cutwater) → Andriy Popovych (popovych-andrey)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Neil Jerram (<email address hidden>) on branch: master
Review: https://review.opendev.org/392645
Reason: Looks like this has been abandoned by the author. Please re-open in case that's wrong.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.