Routed provider network - DHCP agent failure

Bug #1782026 reported by Vlad Sorokin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Kailun Qin

Bug Description

DHCP agent in the network segment in Routed provider network reports the error and does not start dnsmasq process
======

I have routed network based on Mellanox VMS L3 - https://community.mellanox.com/docs/DOC-1432 and a compute nodes on two segments (actually 2 racks with 2 different subnets). I was following the guide on implementing Routed provider networks https://docs.openstack.org/neutron/queens/admin/config-routed-networks.html
======

Rack 1 openvswitch_agent.ini:
[ovs]
bridge_mappings = 40Grack1:br-vlan

Rack 1 openvswitch_agent.ini:
[ovs]
bridge_mappings = 40Grack2:br-vlan
======

Reproduction steps:

openstack network create --project proj --provider-physical-network 40Grack1 --provider-network-type vlan --provider-segment 403 vsorokin-VLAN403-net

openstack network segment set --name vsorokin-VLAN403-net-rack1 17cf03cb-0165-46c4-9586-598ca2239c75

openstack subnet create --network vsorokin-VLAN403-net --network-segment vsorokin-VLAN403-net-rack1 --ip-version 4 --subnet-range 10.243.64.0/22 --gateway none --allocation-pool start=10.243.64.2,end=10.243.67.253 --host-route destination=10.243.64.0/18,gateway=10.243.67.254 vsorokin-VLAN403-subnet-rack1

At this point I can see qdhcp-* netns created and dnsmasq process running on Rack 1 node.

openstack network segment create --physical-network 40Grack2 --network-type vlan --segment 403 --network vsorokin-VLAN403-net vsorokin-VLAN403-net-rack2

openstack subnet create --network vsorokin-VLAN403-net --network-segment vsorokin-VLAN403-net-rack2 --ip-version 4 --subnet-range 10.243.68.0/22 --gateway none --allocation-pool start=10.243.68.2,end=10.243.71.253 --host-route destination=10.243.64.0/18,gateway=10.243.71.254 vsorokin-VLAN403-subnet-rack2

That command causes the error in neutron-dhcp-agent.log in the node in Rack 2(repeating every 30 seconds):
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent [req-279d513d-652e-46dc-94ab-8
90d90a13235 - - - - -] Unable to enable dhcp for 99cfc13a-adec-4dc0-baeb-864437829b3d.: Ke
yError: u'287d9d56-1c0f-4d4b-a5cc-4718efc80436'
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent Traceback (most recent call la
st):
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**
action_kwargs)
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 219, in enable
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent self.spawn_process()
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 446, in spawn_process
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_proc
ess(reload_with_HUP=False)
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 455, in _spawn_or_reload_process
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent self._output_config_files(
)
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 499, in _output_config_files
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent self._output_opts_file()
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 872, in _output_opts_file
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent options, subnet_index_map
= self._generate_opts_per_subnet()
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dis
t-packages/neutron/agent/linux/dhcp.py", line 933, in _generate_opts_per_subnet
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent subnet_dhcp_ip = subnet_to
_interface_ip[subnet.id]
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent KeyError: u'287d9d56-1c0f-4d4b
-a5cc-4718efc80436'
2018-07-16 15:48:03.350 3713 ERROR neutron.agent.dhcp.agent

WHERE 287d9d56-1c0f-4d4b-a5cc-4718efc80436 is the uuid of the subnet in rack1

Then if I restart the dhcp-agent in rack 1, I got the same error referring the uuid of the subnet in rack 2

[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron subnet-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+-------------------------------+----------------------------------+-----------------+----------------------------------------------------+
| id | name | tenant_id | cidr | allocation_pools |
+--------------------------------------+-------------------------------+----------------------------------+-----------------+----------------------------------------------------+
| 287d9d56-1c0f-4d4b-a5cc-4718efc80436 | vsorokin-VLAN403-subnet-rack1 | 86d919dd7c984631aefd9dddb828a5bc | 10.243.64.0/22 | {"start": "10.243.64.2", "end": "10.243.67.253"} |
| 657be337-c316-442a-9742-082773714655 | vsorokin-priv-subnet | 86d919dd7c984631aefd9dddb828a5bc | 10.1.1.0/24 | {"start": "10.1.1.2", "end": "10.1.1.254"} |
| 7ca43b64-0be1-4e75-abda-7c9a4f7aa4c2 | vsorokin-VLAN403-subnet-rack2 | 86d919dd7c984631aefd9dddb828a5bc | 10.243.68.0/22 | {"start": "10.243.68.2", "end": "10.243.71.253"} |
| f5eeabde-8ab1-49a6-845b-2df4f860fec1 | public_subnet1 | 862f6b357fb2496ba1350628a8b08657 | 172.31.192.0/18 | {"start": "172.31.240.1", "end": "172.31.240.254"} |
+--------------------------------------+-------------------------------+----------------------------------+-----------------+----------------------------------------------------+
[vsorokin@xnode12-15 ~(keystone_admin)]$

[vsorokin@xnode12-15 ~(keystone_admin)]$ openstack network segment list --network vsorokin-VLAN403-net
+--------------------------------------+----------------------------+--------------------------------------+--------------+---------+
| ID | Name | Network | Network Type | Segment |
+--------------------------------------+----------------------------+--------------------------------------+--------------+---------+
| 17cf03cb-0165-46c4-9586-598ca2239c75 | vsorokin-VLAN403-net-rack1 | 99cfc13a-adec-4dc0-baeb-864437829b3d | vlan | 403 |
| 705ef7dd-7210-46c1-a8ca-da8e02d32d82 | vsorokin-VLAN403-net-rack2 | 99cfc13a-adec-4dc0-baeb-864437829b3d | vlan | 403 |
+--------------------------------------+----------------------------+--------------------------------------+--------------+---------+
[vsorokin@xnode12-15 ~(keystone_admin)]$

[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron agent-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------+-----------------------------+-------------------+-------+----------------+---------------------------+
| id | agent_type | host | availability_zone | alive | admin_state_up | binary |
+--------------------------------------+--------------------+-----------------------------+-------------------+-------+----------------+---------------------------+
| 22536bc5-0876-4c22-8637-66d269514eb1 | DHCP agent | xnode12-16.pub.pic2.ibm.com | nova | :-) | True | neutron-dhcp-agent |
| 5d537319-4d51-43dd-a227-d61913e37c5b | Open vSwitch agent | xnode12-16.pub.pic2.ibm.com | | :-) | True | neutron-openvswitch-agent |
| 8dee041d-87f9-42e9-9866-21a5b7353041 | Metadata agent | tnode2-15 | | :-) | True | neutron-metadata-agent |
| 8ef4aa73-4090-4e3c-a7b5-fa0e6ce26c1e | Metadata agent | hnode1-5 | | :-) | True | neutron-metadata-agent |
| 98cf9262-5271-4a1c-b0a1-f60b9f049c88 | DHCP agent | tnode2-15 | nova | :-) | True | neutron-dhcp-agent |
| b4ea7c48-87f8-40cd-a543-2143d3b7354a | DHCP agent | hnode1-5 | nova | :-) | True | neutron-dhcp-agent |
| c7d263c7-f128-4618-944b-73debea1e670 | Metering agent | xnode12-16.pub.pic2.ibm.com | | :-) | True | neutron-metering-agent |
| ce076f13-0a99-4a40-baa2-7c2f1afb5cff | Open vSwitch agent | hnode1-5 | | :-) | True | neutron-openvswitch-agent |
| e90db66d-9d94-437f-8d13-4d7e63ee04a9 | L3 agent | xnode12-16.pub.pic2.ibm.com | nova | :-) | True | neutron-l3-agent |
| f11864a0-f423-4933-b6e9-74654896b80b | Metadata agent | xnode12-16.pub.pic2.ibm.com | | :-) | True | neutron-metadata-agent |
| f2db4ea0-af33-4499-b388-e9589bd4fe12 | Open vSwitch agent | tnode2-15 | | :-) | True | neutron-openvswitch-agent |
+--------------------------------------+--------------------+-----------------------------+-------------------+-------+----------------+---------------------------+

[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron net-list-on-dhcp-agent b4ea7c48-87f8-40cd-a543-2143d3b7354a
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
| id | name | tenant_id | subnets |
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
| 99cfc13a-adec-4dc0-baeb-864437829b3d | vsorokin-VLAN403-net | 86d919dd7c984631aefd9dddb828a5bc | 287d9d56-1c0f-4d4b-a5cc-4718efc80436 10.243.64.0/22 |
| | | | 7ca43b64-0be1-4e75-abda-7c9a4f7aa4c2 10.243.68.0/22 |
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
[vsorokin@xnode12-15 ~(keystone_admin)]$
[vsorokin@xnode12-15 ~(keystone_admin)]$
[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron net-list-on-dhcp-agent 98cf9262-5271-4a1c-b0a1-f60b9f049c88
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
| id | name | tenant_id | subnets |
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
| 99cfc13a-adec-4dc0-baeb-864437829b3d | vsorokin-VLAN403-net | 86d919dd7c984631aefd9dddb828a5bc | 287d9d56-1c0f-4d4b-a5cc-4718efc80436 10.243.64.0/22 |
| | | | 7ca43b64-0be1-4e75-abda-7c9a4f7aa4c2 10.243.68.0/22 |
+--------------------------------------+----------------------+----------------------------------+-----------------------------------------------------+
[vsorokin@xnode12-15 ~(keystone_admin)]$

As you can see, neutron is trying to make each DHCP agent serving both subnets. Which I beleive is wrong.

Versions:
OpenStack controller backplane: Queens RDO/CentOS 7.4 x86-64
Nodes hosting openvswitch-agent and dhcp-agent: Queens/Ubuntu 16.04.4 4.13.0-45-generic ppc64le

Revision history for this message
Kailun Qin (kailun.qin) wrote :

I suppose neutron "net-list-on-dhcp-agent" will simply show the network dict made from the network ID, which means the subnets associated with it above does not make much sense in this routed network scenario here. We can say that both agents server the routed network: vsorokin-VLAN403-net which is expected but we can't tell whether neutron is trying to make each DHCP agent serving both subnets from the CLI result at least.

Indeed, per the routed network config guide [1], "Unlike conventional provider networks, a DHCP agent cannot support more than one segment within a network". The current neutron implementation should have checked and prevented this multi-segment connection from a given agent. And it should also have checked if a dhcp agent should be scheduled per segment with a dhcp enabled subnet while scheduling. So the issue might be caused by somewhere hidden.

Would you please kindly verify that each IPv4 subnet associates with at least one DHCP agent? Furthermore, I guess a neutron-server log would be much appreciated. :)

[1] https://docs.openstack.org/neutron/queens/admin/config-routed-networks.html

Revision history for this message
Vlad Sorokin (vvsorokin) wrote :
Download full text (5.1 KiB)

Thank you for the prompt response :)

It's showing "subnets": 0 for both.

[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron dhcp-agent-list-hosting-net vsorokin-VLAN403-net
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+-----------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+-----------+----------------+-------+
| 98cf9262-5271-4a1c-b0a1-f60b9f049c88 | tnode2-15 | True | :-) |
| b4ea7c48-87f8-40cd-a543-2143d3b7354a | hnode1-5 | True | :-) |
+--------------------------------------+-----------+----------------+-------+
[vsorokin@xnode12-15 ~(keystone_admin)]$
[vsorokin@xnode12-15 ~(keystone_admin)]$
[vsorokin@xnode12-15 ~(keystone_admin)]$
[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron agent-show 98cf9262-5271-4a1c-b0a1-f60b9f049c88
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+---------------------+----------------------------------------------------------+
| Field | Value |
+---------------------+----------------------------------------------------------+
| admin_state_up | True |
| agent_type | DHCP agent |
| alive | True |
| availability_zone | nova |
| binary | neutron-dhcp-agent |
| configurations | { |
| | "subnets": 0, |
| | "dhcp_lease_duration": 86400, |
| | "dhcp_driver": "neutron.agent.linux.dhcp.Dnsmasq", |
| | "ports": 0, |
| | "log_agent_heartbeats": false, |
| | "networks": 1 |
| | } |
| created_at | 2018-07-13 18:10:11 |
| description | |
| heartbeat_timestamp | 2018-07-17 19:40:42 |
| host | tnode2-15 |
| id | 98cf9262-5271-4a1c-b0a1-f60b9f049c88 |
| started_at | 2018-07-17 18:41:12 |
| topic | dhcp_agent |
+---------------------+----------------------------------------------------------+
[vsorokin@xnode12-15 ~(keystone_admin)]$ neutron agent-show b4ea7c48-87f8-40cd-a543-2143d3b7354a
neutron CLI is deprecated and will be removed in the future. Use openstack...

Read more...

Revision history for this message
Vlad Sorokin (vvsorokin) wrote :

Server log

Revision history for this message
Kailun Qin (kailun.qin) wrote :

Thanks Vlad for the information.
It seems like a dhcp scheduler related issue to me. I'll look deeper and try reproduction on a devstack environment if needed.

Kailun Qin (kailun.qin)
Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
Revision history for this message
Kailun Qin (kailun.qin) wrote :

@Vlad
After debugging, I found that the root cause is that the non-local subnet introduced by the routed network scenario has no interface in the local DHCP agent's namespace. Thus a KeyError will be raised when trying to get the subnet_dhcp_ip and fail to add host routes. This happens when force_metadata or enable_isolated_metadata has been set in your neutron dhcp config.

A same issue was reported in [1]. And a fix has been proposed and released in Rocky [2].

Please kindly have a try. :)

[1] https://bugs.launchpad.net/neutron/+bug/1758952
[2] https://review.openstack.org/#/c/468744

Revision history for this message
Kailun Qin (kailun.qin) wrote :

Sorry for my copy-paste mistake above. The correct fix should be https://review.openstack.org/#/c/556584/.

Revision history for this message
Vlad Sorokin (vvsorokin) wrote :

Hi Kailun,
I merged https://review.openstack.org/#/c/556584/ into Neutron stable/queens and gave it a try. I can confirm it fixed my problem.
Sorry for making the duplicate bug report - google did not return https://bugs.launchpad.net/neutron/+bug/1758952 :)
Thanks!

Revision history for this message
Lujin Luo (lujin) wrote :

@Kailun, since you assigned the bug to yourself and you have already found a fix to it. Are you working on backporting the fix to stable/queens?

Revision history for this message
Kailun Qin (kailun.qin) wrote :

@Lujin
I am also thinking about backporting the fix to the stable branch. But I saw the previous bug was classified as a medium impact which seems not to meet the proactive backport policy.
Would you please let me know whether we have any restriction on this? If not, I'll certainly work on the backporting. Great thanks!

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Kailun: You can easily backport fixes for "medium" bug to stable branches.

Revision history for this message
Kailun Qin (kailun.qin) wrote :

Thanks Slawek.
Backport patch proposed to stable/queens: https://review.openstack.org/#/c/584264/.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.