dhcp_agents_per_network > 1 cause conflicts (NACKs) from dnsmasqs (break networks)

Bug #1457900 reported by George Shuklin
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton
Icehouse
Fix Released
Undecided
Unassigned
Juno
Fix Released
Undecided
Unassigned
Kilo
New
Undecided
Unassigned

Bug Description

If neutron was configured to have more than one DHCP agent per network (option dhcp_agents_per_network=2), it causes dnsmasq to reject leases of others dnsmasqs, creating mess and stopping instances to boot normally.

Symptoms:

Cirros (at the log):
Sending discover...
Sending select for 188.42.216.146...
Received DHCP NAK
Usage: /sbin/cirros-dhcpc <up|down>
Sending discover...
Sending select for 188.42.216.146...
Received DHCP NAK
Usage: /sbin/cirros-dhcpc <up|down>
Sending discover...
Sending select for 188.42.216.146...
Received DHCP NAK

Steps to reproduce:
1. Set up neutron with VLANs and dhcp_agents_per_network=2 option in neutron.conf
2. Set up two or more different nodes with enabled neutron-dhcp-agent
3. Create VLAN neutron network with --enable-dhcp option
4. Create instance with that network

Expected behaviour:

Instance recieve IP address via DHCP without problems or delays.

Actual behaviour:

Instance stuck in the network boot for long time.
There are complains about NACKs in the logs of dhcp client.
There are multiple NACKs on tcpdump on interfaces

Additional analysis: It is very complex, so I attach example of two parallel tcpdumps from two dhcp namespaces in HTML format.

Version: 2014.2.3

Revision history for this message
George Shuklin (george-shuklin) wrote :
Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Hi George,

I'm suspecting this change may have caused the issue. https://review.openstack.org/#/c/153182/

Is it possible for you to confirm that this is the issue by removing the line that the change added (--dhcp-authoritative)?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/185332

Changed in neutron:
assignee: venkata anil (anil-venkata) → Kevin Benton (kevinbenton)
status: New → In Progress
Revision history for this message
George Shuklin (george-shuklin) wrote :

I've tested it with removed --dhcp-authoritative from options. It seems to be working (no more NACK's).

Revision history for this message
George Shuklin (george-shuklin) wrote :

I've dug into dnsmasq manual, and I believe thant '--dhcp-authoritative' should not be used in the neutron dhcp agent.

  -K, --dhcp-authoritative
              Should be set when dnsmasq is definitely the only DHCP server on
              a network. For DHCPv4, it changes the behaviour from strict RFC
              compliance so that DHCP requests on unknown leases from unknown
              hosts are not ignored. This allows new hosts to get a lease
              without a tedious timeout under all circumstances. It also
              allows dnsmasq to rebuild its lease database without each client
              needing to reacquire a lease, if the database is lost. For
              DHCPv6 it sets the priority in replies to 255 (the maximum)
              instead of 0 (the minimum).

As we understand, if there is more than one dnsmasq on the network it will violate --dhcp-authoritative expectations.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Yes, unfortunately dhcp-authoritative was added to work around it NAKing clients after it had been restarted or rescheduled.

Revision history for this message
George Shuklin (george-shuklin) wrote :

Well, I've looked around and I must agree, it have not easy solution. May be adding logic to dnsmasq... But otherwise iptables seems to be the single option.

But I have concern: should we apply them on the compute host? I think it should be limited only to outgoing traffic from neutron-dhcp-agent. Reason: Any additional iptables rules on compute will slow down neutron near 'fast path' with tenant traffic.

As far as I understand fix, it applied to tenant ports, not to dhcp-agent ports.

tags: added: l3-dvr-backlog
Revision history for this message
Brian Haley (brian-haley) wrote :

As a data point, does this happen with other images besides Cirros? We've run configs with >2 dhcp agents and can't say we've seen this problem, but I think we're probably booting more Ubuntu images than anything. I realize the NAK is still arriving, but if the client ignores it that would be good to know.

tags: removed: l3-dvr-backlog
tags: added: l3-ipam-dhcp
Revision history for this message
Kevin Benton (kevinbenton) wrote :

I'm partial to this fix now: https://review.openstack.org/#/c/185486/2

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/185332
Reason: Preferential fix is Idc91602bf8c474467e596cbd5cbaa8898952c841

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/186298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/186300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/186308

Revision history for this message
George Shuklin (george-shuklin) wrote :

@Brian Haley, sorry for delay.

This problem was caught on Ubuntu/Centos images, but debugged on Cirros. Ubuntu/Centos do not report NACKs but still follow NACKs and repeat requests. Cirros makes problem more visible (and easier to debug because of the password access).

Changed in neutron:
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/185486
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=98d8ad911d07a20af18edb0cac4bcf141a83d969
Submitter: Jenkins
Branch: master

commit 98d8ad911d07a20af18edb0cac4bcf141a83d969
Author: Kevin Benton <email address hidden>
Date: Mon May 25 18:55:44 2015 -0700

    Persist DHCP leases to a local database

    Due to issues caused by dnsmasq restarts sending DHCPNAKs,
    change Ieff0236670c1403b5d79ad8e50d7574c1b694e34 passed the
    'dhcp-authoritative' option to dnsmasq. While this solved the
    restart issue, it broke the multi-DHCP server scenario because
    the dnsmasq instances will NAK requests to a server ID that
    isn't their own.

    Problem DHCP Request Lifecycle:

    Client: DHCPDISCOVER(broadcast)
    Server1: DHCPOFFER
    Server2: DHCPOFFER
    Client: DHCPREQUEST(broadcast with Server-ID=Server1)
    Server1: DHCPACK
    Server2: DHCPNAK(in response to observed DHCPREQUEST with other Server-ID)
              ^---Causes issues

    This change removes the authoritative option so NAKs are not
    send in response to DHCPREQUEST's to other servers. To handle
    the original issue that Ieff0236670c1403b5d79ad8e50d7574c1b694e34
    was inteded to address, this patch also allows changes to be persisted
    to a local lease file.

    In order to handle the issue where a DHCP server may be scheduled
    to another agent, a fake lease file is generated for dnsmasq to start
    with. The contents are populated based on all of the known ports for
    a network. This should prevent dnsmasq from NAKing clients renewing
    leases issued before it was restarted/rescheduled.

    Closes-Bug: #1457900
    Change-Id: Idc91602bf8c474467e596cbd5cbaa8898952c841

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/186003
Reason: The Kevin's patch is merged, so we don't need that one anymore.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/186300
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c07687c985703609ccc80f0151a2a40e2036a3f4
Submitter: Jenkins
Branch: stable/kilo

commit c07687c985703609ccc80f0151a2a40e2036a3f4
Author: Kevin Benton <email address hidden>
Date: Mon May 25 18:55:44 2015 -0700

    Persist DHCP leases to a local database

    Due to issues caused by dnsmasq restarts sending DHCPNAKs,
    change Ieff0236670c1403b5d79ad8e50d7574c1b694e34 passed the
    'dhcp-authoritative' option to dnsmasq. While this solved the
    restart issue, it broke the multi-DHCP server scenario because
    the dnsmasq instances will NAK requests to a server ID that
    isn't their own.

    Problem DHCP Request Lifecycle:

    Client: DHCPDISCOVER(broadcast)
    Server1: DHCPOFFER
    Server2: DHCPOFFER
    Client: DHCPREQUEST(broadcast with Server-ID=Server1)
    Server1: DHCPACK
    Server2: DHCPNAK(in response to observed DHCPREQUEST with other Server-ID)
              ^---Causes issues

    This change removes the authoritative option so NAKs are not
    send in response to DHCPREQUEST's to other servers. To handle
    the original issue that Ieff0236670c1403b5d79ad8e50d7574c1b694e34
    was inteded to address, this patch also allows changes to be persisted
    to a local lease file.

    In order to handle the issue where a DHCP server may be scheduled
    to another agent, a fake lease file is generated for dnsmasq to start
    with. The contents are populated based on all of the known ports for
    a network. This should prevent dnsmasq from NAKing clients renewing
    leases issued before it was restarted/rescheduled.

    Closes-Bug: #1457900
    Change-Id: Idc91602bf8c474467e596cbd5cbaa8898952c841
    (cherry picked from commit 98d8ad911d07a20af18edb0cac4bcf141a83d969)

tags: added: in-stable-kilo
tags: added: in-stable-juno
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/186298
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=11e41664c2f13c2fb9dab9b42b79adac31e779a1
Submitter: Jenkins
Branch: stable/juno

commit 11e41664c2f13c2fb9dab9b42b79adac31e779a1
Author: Kevin Benton <email address hidden>
Date: Mon May 25 18:55:44 2015 -0700

    Persist DHCP leases to a local database

    Due to issues caused by dnsmasq restarts sending DHCPNAKs,
    change Ieff0236670c1403b5d79ad8e50d7574c1b694e34 passed the
    'dhcp-authoritative' option to dnsmasq. While this solved the
    restart issue, it broke the multi-DHCP server scenario because
    the dnsmasq instances will NAK requests to a server ID that
    isn't their own.

    Problem DHCP Request Lifecycle:

    Client: DHCPDISCOVER(broadcast)
    Server1: DHCPOFFER
    Server2: DHCPOFFER
    Client: DHCPREQUEST(broadcast with Server-ID=Server1)
    Server1: DHCPACK
    Server2: DHCPNAK(in response to observed DHCPREQUEST with other Server-ID)
              ^---Causes issues

    This change removes the authoritative option so NAKs are not
    send in response to DHCPREQUEST's to other servers. To handle
    the original issue that Ieff0236670c1403b5d79ad8e50d7574c1b694e34
    was inteded to address, this patch also allows changes to be persisted
    to a local lease file.

    In order to handle the issue where a DHCP server may be scheduled
    to another agent, a fake lease file is generated for dnsmasq to start
    with. The contents are populated based on all of the known ports for
    a network. This should prevent dnsmasq from NAKing clients renewing
    leases issued before it was restarted/rescheduled.

    Closes-Bug: #1457900
    Change-Id: Idc91602bf8c474467e596cbd5cbaa8898952c841
    (cherry picked from commit 4aba83fd22c8880a2d24f27e43a32cc164abe3e0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/juno)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/185996
Reason: Kevin's patch is merged, now it's not needed to revert.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/186308
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=61571b47c6f2c08509ae22a0587cec8056407b00
Submitter: Jenkins
Branch: stable/icehouse

commit 61571b47c6f2c08509ae22a0587cec8056407b00
Author: Kevin Benton <email address hidden>
Date: Mon May 25 18:55:44 2015 -0700

    Persist DHCP leases to a local database

    Due to issues caused by dnsmasq restarts sending DHCPNAKs,
    change Ieff0236670c1403b5d79ad8e50d7574c1b694e34 passed the
    'dhcp-authoritative' option to dnsmasq. While this solved the
    restart issue, it broke the multi-DHCP server scenario because
    the dnsmasq instances will NAK requests to a server ID that
    isn't their own.

    Problem DHCP Request Lifecycle:

    Client: DHCPDISCOVER(broadcast)
    Server1: DHCPOFFER
    Server2: DHCPOFFER
    Client: DHCPREQUEST(broadcast with Server-ID=Server1)
    Server1: DHCPACK
    Server2: DHCPNAK(in response to observed DHCPREQUEST with other Server-ID)
              ^---Causes issues

    This change removes the authoritative option so NAKs are not
    send in response to DHCPREQUEST's to other servers. To handle
    the original issue that Ieff0236670c1403b5d79ad8e50d7574c1b694e34
    was inteded to address, this patch also allows changes to be persisted
    to a local lease file.

    In order to handle the issue where a DHCP server may be scheduled
    to another agent, a fake lease file is generated for dnsmasq to start
    with. The contents are populated based on all of the known ports for
    a network. This should prevent dnsmasq from NAKing clients renewing
    leases issued before it was restarted/rescheduled.

    Conflicts:
        neutron/tests/unit/test_linux_dhcp.py - Subnet ID had to be added
        to the fake ports so they weren't skipped

    Closes-Bug: #1457900
    Change-Id: Idc91602bf8c474467e596cbd5cbaa8898952c841
    (cherry picked from commit 4aba83fd22c8880a2d24f27e43a32cc164abe3e0)

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/icehouse)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/185971
Reason: Not needed anymore, Kevin's patch is merged.

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/196701
Reason: This is lacking the functional fix [1], so I'll propose a new merge commit which includes that one.

[1] https://review.openstack.org/#/c/196711/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196920

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (171.5 KiB)

Reviewed: https://review.openstack.org/196920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f759c077f8f860c13db92d2ea6b353ef6b70900
Submitter: Jenkins
Branch: feature/pecan

commit 8123144fadd7c5d5e6e56a76ea860512619a2cf6
Author: Moshe Levi <email address hidden>
Date: Sun Jun 28 14:37:14 2015 +0300

    Fix Consolidate sriov agent and driver code

    This patch add mising __init to mech_sriov/mech_driver/
    and update the setup.cfg to the new agent entrypoint

    Trivial Fix

    Change-Id: I53a527081feb78472f496675bbb3c5121d38a14a

commit 8942fccf02e6e179d47582fdb2792a1ca972da21
Author: Assaf Muller <email address hidden>
Date: Mon Jun 29 11:38:51 2015 -0400

    Remove failing SafeFixture tests

    The fixtures 1.3 release attempted to fix the fixtures resource
    leak issue, but failed to do so completely. Our own SafeFixture
    is still needed: The 1.3 release broke our SafeFixture tests,
    but not the usage of SafeFixture itself. This patch removes
    those failing tests for now to unbreak the gate. Jakub reported
    a bug on fixtures 1.3:
    https://bugs.launchpad.net/python-fixtures/+bug/1469759

    We will continue to use SafeFixture until that bug is fixed
    in fixtures, at which point we will be able to require
    fixtures > 1.3.

    Change-Id: I59457c3bb198ff86d5ad55a1e623d008f0034b8f
    Closes-Bug: #1469734

commit 71dffb0a2c1720cd8233a329d32958a0160dd6f5
Author: Kevin Benton <email address hidden>
Date: Mon Jun 29 08:27:41 2015 +0000

    Revert "Removed test_lib module"

    This reverts commit 9a6536de6e1a7fe9b2552adc142e254426b82b6f.

    We pulled all of the plugins out of the tree, many of which still inherit
    from neutron test classes. This change then stated that we no longer
    support testing other plugins. I think this is a bit premature and should
    have been discussed under the subject
    "Neutron plugins can't use neutron plugin unit tests" or something
    similar.

    Change-Id: I68318589f010b731574ea3bfa8df98492bab31fc

commit b20fd81dbd497e058384a0af065dd0f1fdc4c728
Author: Jakub Libosvar <email address hidden>
Date: Fri Jun 5 14:32:51 2015 +0000

    Refactor NetcatTester class

    Following capabilities were added:
       - used transport protocol is passed as a constant instead of bool
       - src port for testing was added
       - connection can be established explicitly
       - change constructor parameters of NetcatTester

    As a part of removing bool for protocol definition
    get_free_namespace_port() was also modified to match the behavior.

    Change-Id: Id2ec322e7f731c05a3754a65411c9a5d8b258126

commit 83e37980dcd0b2bad6d64dd2cb23bcd2891cafca
Author: jingliuqing <email address hidden>
Date: Sat Jun 27 13:41:54 2015 +0800

    Use REST rather than ReST

    Change-Id: I06c9deaab58c5ec13bfeec39fb8fd4b1fe21f42d

commit 1b60df85ba3ad442c2e4e7e52538e1b9a1bf9378
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:34:38 2015 -0700

    Add a double-mock guard to the base test case

    Use mock to patch mock with a check to prevent multiple active
    patches to the...

tags: added: in-feature-pecan
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.