Neutron DHCP agent sets up wrong ports after the failover

Bug #1499914 reported by Andrey Grebennikov
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Eugene Nikanorov
6.1.x
Fix Released
High
Eugene Nikanorov
7.0.x
Fix Released
High
Eugene Nikanorov
8.0.x
Fix Released
High
Eugene Nikanorov

Bug Description

Fuel 6.1 Ubuntu HA.

The tenant network is created and assigned to be handled by 2 controllers.
Reboot one of the controllers, the DHCP role is assigned to the new controller, but the port remains to be the same. When controller comes back, it may again have the namespace configured and may provide DHCP service.

The whole picture may look like this:

root@GGUTTPLDI001:~# neutron dhcp-agent-list-hosting-net d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e
+--------------------------------------+-------------------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+-------------------------------+----------------+-------+
| 3ff6ab0c-27c3-4f9d-8a91-a6b83a025a1f | GGUTTPLDI002.ebiz.verizon.com | True | :-) |
| c66bedc8-2575-4ce8-96f0-41a8d9f471b6 | GGUTTPLDI003.ebiz.verizon.com | True | :-) |
+--------------------------------------+-------------------------------+----------------+-------+

DHCP ports:

root@GGUTTPLDI001:~# neutron port-list --network_id=d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e --device_owner=network:dhcp
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+
| id | name | mac_address | fixed_ips |
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+
| 3365c886-803b-4bf3-a117-3dbcdca7ac21 | | fa:16:3e:8f:24:56 | {"subnet_id": "975b250c-941d-4f50-853a-c9bd9ca01a7a", "ip_address": "10.73.205.235"} |
| | | | {"subnet_id": "2012df4e-f059-4dab-be52-30098f63bcd2", "ip_address": "10.73.205.249"} |
| | | | {"subnet_id": "03c9c3e7-7b12-4574-9863-4dba0e32e322", "ip_address": "10.73.205.51"} |
| 4c34478e-84dd-4d41-b6d6-a3e476042e07 | | fa:16:3e:1b:56:94 | {"subnet_id": "975b250c-941d-4f50-853a-c9bd9ca01a7a", "ip_address": "10.73.205.236"} |
| | | | {"subnet_id": "2012df4e-f059-4dab-be52-30098f63bcd2", "ip_address": "10.73.205.250"} |
| | | | {"subnet_id": "03c9c3e7-7b12-4574-9863-4dba0e32e322", "ip_address": "10.73.205.53"} |
+--------------------------------------+------+-------------------+--------------------------------------------------------------------------------------+

Ports description:
root@GGUTTPLDI001:~# neutron port-show 3365c886-803b-4bf3-a117-3dbcdca7ac21|egrep 'host|device_id'
| binding:host_id | GGUTTPLDI002.ebiz.verizon.com |
| device_id | dhcpeb8043a0-2994-5a65-bb20-d8d9abb64aa0-d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e |
root@GGUTTPLDI001:~# neutron port-show 4c34478e-84dd-4d41-b6d6-a3e476042e07|egrep 'host|device_id'
| binding:host_id | GGUTTPLDI001.ebiz.verizon.com |
| device_id | dhcp78bc93dd-f7f4-57c2-9afb-4853793f53f2-d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e |

Both ports have host-id Not equal to the host defined in device id:

>>> local_hostname='GGUTTPLDI002'
>>> host_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, str(local_hostname))
>>> print host_uuid
78bc93dd-f7f4-57c2-9afb-4853793f53f2
>>> local_hostname='GGUTTPLDI001'
>>> print uuid.uuid5(uuid.NAMESPACE_DNS, str(local_hostname))
065a7b7e-ed52-5b0f-8c10-d22107b4c5a0
>>> local_hostname='GGUTTPLDI003'
>>> print uuid.uuid5(uuid.NAMESPACE_DNS, str(local_hostname))
eb8043a0-2994-5a65-bb20-d8d9abb64aa0

At the same time the namespaces are messed up as well:
root@GGUTTPLDI001:~# ip netns exec qdhcp-d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e ip a|grep tap
519: tap4c34478e-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default
    inet 10.73.205.236/28 brd 10.73.205.239 scope global tap4c34478e-84
    inet 10.73.205.250/28 brd 10.73.205.255 scope global tap4c34478e-84
    inet 10.73.205.53/28 brd 10.73.205.63 scope global tap4c34478e-84

root@GGUTTPLDI002:~# ip netns exec qdhcp-d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e ip a|grep tap
128: tap3365c886-80: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default
    inet 10.73.205.235/28 brd 10.73.205.239 scope global tap3365c886-80
    inet 10.73.205.249/28 brd 10.73.205.255 scope global tap3365c886-80
    inet 10.73.205.51/28 brd 10.73.205.63 scope global tap3365c886-80
148: tap4c34478e-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default
    inet 10.73.205.236/28 brd 10.73.205.239 scope global tap4c34478e-84
    inet 10.73.205.250/28 brd 10.73.205.255 scope global tap4c34478e-84
    inet 10.73.205.53/28 brd 10.73.205.63 scope global tap4c34478e-84

root@GGUTTPLDI003:~# ip netns exec qdhcp-d3b3e77e-8dfd-4a48-aa1f-a2340cf2ef5e ip a|grep tap
137: tap3365c886-80: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default
    inet 10.73.205.235/28 brd 10.73.205.239 scope global tap3365c886-80
    inet 10.73.205.249/28 brd 10.73.205.255 scope global tap3365c886-80
    inet 10.73.205.51/28 brd 10.73.205.63 scope global tap3365c886-80

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :
tags: added: ceilometer
Changed in mos:
assignee: nobody → MOS Neutron (mos-neutron)
importance: Undecided → High
status: New → Confirmed
milestone: none → 7.0-updates
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/12480

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :
tags: added: neutron
removed: ceilometer
tags: added: customer-found
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

For 8.0 this bug most probably is not valid. The fix has been committed during liberty so 8.0 should get it automatically.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Moved just to 7.0-updates, not to mu-1, since we are not able to reproduce this issue on 7.0 code base.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

In fact, the problem exists for 7.0 and 8.0 too

tags: added: 70mu1-confirmed
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Moving to 6.1-mu-4 as the merge windows for mu-3 is now closed

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Further analysis show that the issue happens when several conditions are met:
1) there is a reserved DHCP port (e.g. the port which device_id is reserved)
2) multiple DHCP agents are configured to host 1 network
3) DHCP agents are started simultaneously

In these conditions DHCP agents first fetch all networks that are scheduled to them and the process them, acquiring their DHCP ports. But they don't care if reserved DHCP port is already acquired and its device_id changed - that's because DHCP port is cached during first fetch.

We need to ask if DHCP port is ok to acquire.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

> We need to ask if DHCP port is ok to acquire.
Should be 'We need to check if DHCP port is ok to update'

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :
tags: added: support
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13158

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13159

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/13158
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: e3eb7cc5bbebd9a682a6ffb8bc830781d1353491
Author: Eugene Nikanorov <email address hidden>
Date: Thu Oct 29 09:24:48 2015

Cleanup dhcp namespace upon dhcp setup

In some cases when more than 1 DHCP agents were assigned
to a network and then they became dead, their DHCP ports
become reserved. Later, when those agents revive or start
again, they acquire reserved ports, but it's not guaranteed
that they get exactly same ports. In such case DHCP agent
may create interface in the namespaces despite that another
interface already exist. In such case there will be two
hosts with dhcp namespaces each containing duplicate ports,
e.g. one port will be present on two hosts. This breaks
DHCP.

Change-Id: I9daa5585193d2244cf4bea9470a25de3263f4c6b
Closes-Bug: #1499914

tags: removed: 70mu1-confirmed
tags: added: on-verification
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

verify on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"
with updates
vlan + neutron, 3 controller noda and 2 compute node

Reproduced on release 7-0 withoutupdates

Steps:
1) Create 50 networks, subnets, boot and delete vm
2) Disable all agents
3) Make all dhcp ports as reserved_dhcp_port
mysql
use neutron;
delete from networkdhcpagentbindings;
update ports set device_id='reserved_dhcp_port' where device_owner='network:dhcp';
4) Enable dhcp agents
5) Check ip a and find tap-interfaces
Expected results: all tap's id are not equal
Result on 7-0: there are some equal id's

tags: removed: on-verification
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13928

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/13928
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 26b384075ba8d312b369cd3cc466e4920b121afc
Author: Eugene Nikanorov <email address hidden>
Date: Thu Nov 19 12:28:24 2015

Avoid race condition for reserved DHCP ports

This patch introduces mechanism similar to compare-and-swap
for updating reserved DHCP port.

This addresses a case when two DHCP agents that start nearly at
the same time are assigned to one network and there is a reserved
DHCP port in the network. Then each of agents will try to use it
because agents don't check if reserved port is still available.
Reserved DHCP port can be acquired by different agent between calls to
get_active_networks and update_port, so this patch adds a check for
this case.

Cherry-picked from commit f76ef76f2516dad794818ce56fb15d16437f7314
Change-Id: I0277ab537ff9d3a664c03ea291b9ec2b0e784dbb
Closes-Bug: #1425402
Closes-Bug: #1499914

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-6.1/2014.2)

Reviewed: https://review.fuel-infra.org/13159
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 7248cd1c66072749b3663f779b2b9cb7eeddf8fe
Author: Eugene Nikanorov <email address hidden>
Date: Thu Nov 26 13:47:57 2015

Cleanup dhcp namespace upon dhcp setup

In some cases when more than 1 DHCP agents were assigned
to a network and then they became dead, their DHCP ports
become reserved. Later, when those agents revive or start
again, they acquire reserved ports, but it's not guaranteed
that they get exactly same ports. In such case DHCP agent
may create interface in the namespaces despite that another
interface already exist. In such case there will be two
hosts with dhcp namespaces each containing duplicate ports,
e.g. one port will be present on two hosts. This breaks
DHCP.

Change-Id: I9daa5585193d2244cf4bea9470a25de3263f4c6b
Closes-Bug: #1499914

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "232"
  build_id: "232"
  fuel-nailgun_sha: "0a2f99530cf7246acf85db643032a0550168aac5"
  python-fuelclient_sha: "b1ffe1cae9ce7b612d3f746c8e2e2fde6f732748"
  fuel-agent_sha: "bd67efbadabfd8242c979c50b7d61a251621621a"
  fuel-nailgun-agent_sha: "a33a58d378c117c0f509b0e7badc6f0910364154"
  astute_sha: "b60624ee2c5f1d6d805619b6c27965a973508da1"
  fuel-library_sha: "8b22d9db4d490cd9beb9261c15e0571bd3b3e7d6"
  fuel-ostf_sha: "a98973482f839554d90cc1c071d625a01e018cfe"
  fuel-createmirror_sha: "3cb98030d4a12992ea1cda1f464f035980569d2f"
  fuelmenu_sha: "fcb15df4fd1a790b17dd78cf675c11c279040941"
  shotgun_sha: "25a0cc461a9fa4f7684f04cef0ff4ad9aa99a64d"
  network-checker_sha: "0b1b94a9685c6471d6911dff7ecac10b7bd2625f"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "2eca6adc33f02e02cd812e1d4be7c70e05fd07db"
(neutron+vxlan, 3 controllers, 2 compute)
Steps from previous verification on updates

tags: added: on-verification
Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Bug reproduced after applying fix on Ubuntu 6.1.
Steps to reproduce in comment https://bugs.launchpad.net/mos/+bug/1499914/comments/14

tags: removed: on-verification
Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Bug reproduced after fix baceuse of bug https://bugs.launchpad.net/mos/+bug/1444978 .
Workaround:
1. On each controllers change parameter dhcp_delete_namespaces from False to True in file /etc/neutron/dhcp_agent.ini
2. On any controller run command: pcs resource disable clone_p_neutron-dhcp-agent --wait; pcs resource enable clone_p_neutron-dhcp-agent --wait

tags: added: on-automation
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :
tags: removed: on-automation
tags: added: covered-automated-test
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/12480
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: de4838845ca2de277770ea6c3f8a1c423b7e20cd
Author: Eugene Nikanorov <email address hidden>
Date: Mon May 16 08:21:41 2016

Perform full sync of DHCP agent after its revival

That might be important during rabbitmq failover when networks
could be scheduled/unscheduled from agents several times.

Change-Id: I0c373103e1abacf639f283b3eda3c6ecd6b284ce
Related-Bug: #1499914
Closes-Bug: #1493785

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-6.1/2014.2)

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-6.1/2014.2
Review: https://review.fuel-infra.org/12997

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.