No fdb entries added when failover dhcp and l3 agent together

Bug #1411163 reported by Xiang Hui on 2015-01-15
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Medium
venkata anil
Kilo
Undecided
Unassigned
neutron (Ubuntu)
Medium
Unassigned
Trusty
Medium
Xiang Hui

Bug Description

[Env]

OpenStack: icehouse
OS: ubuntu
enable l2 population
enable gre tunnel

[Description]
If the dhcp and l3 agent on the same host, then after this host is down, then there will be a probability that scheduled to other same host, then sometimes the ovs tunnel can't be created on the new scheduled host.

[Root cause]
After debugging, we found below log:
2015-01-14 13:44:18.284 9815 INFO neutron.plugins.ml2.drivers.l2pop.db [req-e36fe1fe-a08c-43c9-9d9c-75fe714d6f91 None] query:[<neutron.db.model
s_v2.Port[object at 7f8d706a3650] {tenant_id=u'ae27091dccf148249349d6396e10f230', id=u'2061f5e4-c4a0-42ae-b611-4fe6c2c5cfbd', name=u'', network
_id=u'12ee7040-119a-47bf-a968-67509ebb8eda', mac_address=u'fa:16:3e:b6:20:8e', admin_state_up=True, status=u'ACTIVE', device_id=u'dhcp28f6fc30-
af6e-5f44-ae85-dcc1cc074ee5-12ee7040-119a-47bf-a968-67509ebb8eda', device_owner=u'network:dhcp'}>, <neutron.db.models_v2.Port[object at 7f8d706
a37d0] {tenant_id=u'ae27091dccf148249349d6396e10f230', id=u'6e99eae8-5c6a-4b8e-b9e1-dbd8d133dfa1', name=u'', network_id=u'12ee7040-119a-47bf-a9
68-67509ebb8eda', mac_address=u'fa:16:3e:22:56:ba', admin_state_up=True, status=u'ACTIVE', device_id=u'e63a0802-d86d-4a30-95fa-0005a6aef6fb', d
evice_owner=u'network:router_interface'}>]

Above shows there will be a probability that two ACTIVE ports shows up in db together, but from l2 pop mech_driver:
"
if agent_active_ports == 1 or (
                self.get_agent_uptime(agent) < cfg.CONF.l2pop.agent_boot_time):
"
only in above condition the fdb entry will be added and notified to agent, so failures are pop up.

---------------------------

[Impact]

This patch addresses an issue on neutron(1:2014.1.5-0ubuntu6) where no tunnel is created between network node and compute node during failover testing and which causing vms unreachable.

[Test Case]
Deploy an OpenStack cloud w/ trusty-icehouse with neutron HA, doing failover tests on network nodes, there will be a high certain rate that this issue happened.

[Regression Potential]

None.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
tags: added: l2-pop
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Changed in neutron:
status: Confirmed → In Progress
venkata anil (anil-venkata) wrote :

I see the issue, when dhcp port is first created and then the router port created on the new host.

In this case, when dhcp port is first created, it has to create tunnel port also. But it is not doing that -

When a dhcp server is moved new host,
1) network-hostagent binding is updated(properly i.e with new host).
2) dhcp port-hostagent binding is not updated( dhcp port is still bound to old host)
If dhcp port-bound agent is different from the new dhcp agent(which is now taking care of this dhcp port), neutron plugin won't notify the l2pop, and hence tunnel is not created.

def update_device_up(self, rpc_context, **kwargs):
        if (host and not plugin.port_bound_to_host(rpc_context,
                                                   port_id, host)):
            LOG.debug("Device %(device)s not bound to the"
                      " agent host %(host)s",
                      {'device': device, 'host': host})
            return

So, the root cause is - issue with dhcp and not l2pop(l2pop conditions are fine).

venkata anil (anil-venkata) wrote :

This issue can be easily reproducible with this scenario

1) I have a multi-node devstack setup with one controller(node1) and 2 network nodes(node2 and node3), each network node running q-dhcp agent.
2) When I created a network, dhcp server is is spawned on node3, so dhcp port is bound to node3(in port show, binding:host_id is node3).
3) I booted a vm on same network on node1(controller node).
4) Now, I shutdown node3, dhcp port is created on node2, but no tunnel is created

So tunnel is not created when a dhcp port is created on non-binding host.

venkata anil (anil-venkata) wrote :

Change submitted https://review.openstack.org/#/c/197937/ to fix this bug.

Reviewed: https://review.openstack.org/197937
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=571453c614577ffd47e23de9a35863d2f2119583
Submitter: Jenkins
Branch: master

commit 571453c614577ffd47e23de9a35863d2f2119583
Author: venkata anil <email address hidden>
Date: Thu Jul 2 11:22:39 2015 +0000

    Update dhcp host portbinding on failover

    When a dhcp server is moved to new host on failover,
    tunnel is not created.

    When a dhcp server is moved to new host,
    1) network-hostagent binding is updated(properly i.e with new host).
    2) dhcp port-hostagent binding is not updated
       ( dhcp port is still bound to old host)
    If dhcp port-bound agent is different from the new dhcp agent
    (which is now taking care of this dhcp port), neutron plugin won't
     notify the l2pop, and hence tunnel is not created.

    As after failover, the new agent is taking care of this dhcp port,
    update portbinding with the new host. This will allow neutron plugin
    to notify l2pop(which will create tunnel).

    Change-Id: Ib7d7dcddee005395af116ccd31a43853332ae317
    Closes-bug: #1411163

Changed in neutron:
status: In Progress → Fix Committed
Download full text (37.3 KiB)

Reviewed: https://review.openstack.org/211492
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a7b91632fc65ab9d2687298c68b1d715866d0356
Submitter: Jenkins
Branch: feature/pecan

commit 966203f89dee8fe61fb2dce654e36e510e80380f
Author: Sukhdev Kapur <email address hidden>
Date: Wed Jul 1 16:30:44 2015 -0700

    Neutron-Ironic integration patch

    This patch is in preparation for the integration
    of Ironic and Neutron. A new vnic_type is being
    added so that ML2 drivers can filter for all
    Ironic ports based upon match for 'baremetal'.
    Nova/Ironic will set this vnic_type when issuing
    port-create request to neutron.
    (e.g. binding:vnic_type = 'baremetal' )

    Change-Id: I25dc9472b31db052719db503a10c1fb1a55572ef
    Partial-Implements: blueprint neutron-ironic-integration

commit 236e408272bcb9b8e957524864e571b5afdc4623
Author: Oleg Bondarev <email address hidden>
Date: Tue Jul 7 12:02:58 2015 +0300

    DVR: fix router scheduling

    Fix scheduling of DVR routers to not stop scheduling once
    csnat portion was scheduled. See bug report for failing
    scenario.

    This partially reverts
    commit 3794b4a83e68041e24b715135f0ccf09a5631178
    and fixes bug 1374473 by moving csnat scheduling
    after general dvr router scheduling, so double binding does
    not happen.

    Closes-Bug: #1472163
    Related-Bug: #1374473
    Change-Id: I57c06e2be732e47b6cce7c724f6b255ea2d8fa32

commit e152f93878b9bb6af7cfedc9e045892fcf7d0615
Author: Assaf Muller <email address hidden>
Date: Sat Aug 8 21:15:03 2015 +0300

    TESTING.rst love

    Change-Id: I64b569048f8f87ea2fe63d861302b4020d36493d

commit 633c52cca1b383af2c900e1663c8682114acd177
Author: sridhargaddam <email address hidden>
Date: Wed Aug 5 10:49:33 2015 +0000

    Avoid dhcp_release for ipv6 addresses

    dhcp_release is only supported for IPv4 addresses [1] and not for
    IPv6 addresses [2]. There will be no effect when it is called with
    IPv6 address. This patch adds a corresponding note and avoids calling
    dhcp_release for IPv6 addresses.

    [1] http://manpages.ubuntu.com/manpages/trusty/man1/dhcp_release.1.html
    [2] http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2013q2/007084.html

    Change-Id: I8b8316c9d3d011c2a687a3a1e2a4da5cf1b5d604

commit 2de8fad17402f38bbc30204ee2f4f99cf21cb69d
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Aug 10 06:11:06 2015 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I2b423e83a7d0ac8b23239f81fe33dd8382c6fff6

commit fef79dc7b9162e03c8891645494c115b52d4d014
Author: Henry Gessau <email address hidden>
Date: Mon Aug 3 23:30:34 2015 -0400

    Consistent layout and headings for devref

    The lack of convention for heading levels among the independently
    written devref documents was starting to make the Table of Contents
    look rather messy when rendered in HTML.

    This patch does not cover the "Neutron Internals" section since its
    layo...

tags: added: in-feature-pecan
Thierry Carrez (ttx) on 2015-09-03
Changed in neutron:
milestone: none → liberty-3
status: Fix Committed → Fix Released
tags: added: juno-backport-potential kilo-backport-potential

Reviewed: https://review.openstack.org/232911
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9950ce2bbc5a554c211935a6652c304937d39195
Submitter: Jenkins
Branch: stable/kilo

commit 9950ce2bbc5a554c211935a6652c304937d39195
Author: venkata anil <email address hidden>
Date: Thu Jul 2 11:22:39 2015 +0000

    Update dhcp host portbinding on failover

    When a dhcp server is moved to new host on failover,
    tunnel is not created.

    When a dhcp server is moved to new host,
    1) network-hostagent binding is updated(properly i.e with new host).
    2) dhcp port-hostagent binding is not updated
       ( dhcp port is still bound to old host)
    If dhcp port-bound agent is different from the new dhcp agent
    (which is now taking care of this dhcp port), neutron plugin won't
     notify the l2pop, and hence tunnel is not created.

    As after failover, the new agent is taking care of this dhcp port,
    update portbinding with the new host. This will allow neutron plugin
    to notify l2pop(which will create tunnel).

    Change-Id: Ib7d7dcddee005395af116ccd31a43853332ae317
    Closes-bug: #1411163
    (cherry picked from commit 571453c614577ffd47e23de9a35863d2f2119583)

tags: added: in-stable-kilo
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-3 → 7.0.0
Xiang Hui (xianghui) wrote :
description: updated
Changed in neutron (Ubuntu):
status: New → Invalid
Corey Bryant (corey.bryant) wrote :

Thanks for the patch Hui. I've uploaded this to the trusty review queue and it is awaiting SRU Team review.

Brian Murray (brian-murray) wrote :

Reviewing this SRU its not clear why the devel task for neutron is Invalid. Has this been fixed in Zesty? If so in what release did the fix land? Additionally, a "Regression Potential" of None is frowned upon. Is there really no regression potential. If so please explain how this is possible.

Changed in neutron (Ubuntu Trusty):
status: New → Incomplete
Xiang Hui (xianghui) wrote :

Reviewing this SRU its not clear why the devel task for neutron is Invalid.
  The devel task for neutron already have it, since it has been fixed in neutron7.0.0 liberty, afterwards version e.g Mitaka Newton are all have already had it in the neutron upstream code.

Has this been fixed in Zesty? If so in what release did the fix land?
  Yes, same as above, Newton in Zesty have already had it in the neutron code.

Additionally, a "Regression Potential" of None is frowned upon. Is there really no regression potential. If so please explain how this is possible.
  I am just backporting the higher version fix from liberty to icehouse, and any new release are already having it, with such a simple fix, I really can't think out the Regression Potential.

Changed in neutron (Ubuntu Trusty):
status: Incomplete → Fix Committed
assignee: nobody → Xiang Hui (xianghui)

Hello Xiang, or anyone else affected,

Accepted neutron into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/1:2014.1.5-0ubuntu8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Corey Bryant (corey.bryant) wrote :

Regression tested successfully on trusty-proposed.

Changed in neutron (Ubuntu):
status: Invalid → Fix Released
Changed in neutron (Ubuntu Trusty):
importance: Undecided → Medium
Changed in neutron (Ubuntu):
importance: Undecided → Medium
tags: added: verification-done
removed: verification-needed
Brian Murray (brian-murray) wrote :

There was an autopkgtest failure with the version of the package in this SRU.

http://autopkgtest.ubuntu.com/packages/n/neutron/trusty/i386

Please investigate the issue and rerun the test if you think the error is a temporary failure.

Xiang Hui (xianghui) wrote :

To whom will rerun the autopkgtest failure, below is the step:
1. sudo apt-get install autopkgtest qemu-system qemu-utils genisoimage
2. sudo adt-buildvm-ubuntu-cloud -v -r trusty
3. mkdir /tmp/neutron
4. sudo adt-run neutron -U --apt-pocket=proposed --- qemu adt-trusty-amd64-cloud.img -d -o /tmp/neutron/

Ref: http://packaging.ubuntu.com/html/auto-pkg-test.html#executing-the-test

Corey Bryant (corey.bryant) wrote :

I ran the autopkgtests successfully locally, so I'm rerunning the failing ones via update-excuses.

Corey Bryant (corey.bryant) wrote :

autopkgtests are successful now.

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu8

---------------
neutron (1:2014.1.5-0ubuntu8) trusty; urgency=medium

    * d/p/update_dhcp_host_portbinding_on_failover.patch: Update dhcp
      host portbinding on failover.(LP: #1411163)

 -- Hui Xiang <email address hidden> Fri, 25 Nov 2016 15:59:33 +0800

Changed in neutron (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers