DHCP agent scheduler can schedule dnsmasq to an agent without reachability to the network its supposed to serve

Bug #1478100 reported by Assaf Muller
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Cedric Brandily

Bug Description

While overlay networks are typically available on every host, flat or VLAN provider networks are often not. It may be the case where each rack only has access to a subset of networks defined in Neutron (Determined by the network's physical_network tag). In these cases, you would install a DHCP agent in every rack, but the DHCP scheduler could schedule a network to the wrong agent, and you end up in a situation where the dnsmasq instance is on the wrong rack and has no reachibility to its VMs.

Here's a diagram to explain the use case:
http://i.imgur.com/NTBxRxk.png
Both networks on physical_network 1 should be served only by DHCP agent 1.

More information may be found here:
https://etherpad.openstack.org/p/Network_Segmentation_Usecases.
Specifically "DHCP agents and metadata serices are run on nodes within each L2. When the neutron network is created we specifically assign the dhcp agent in that segment to that network".

Tags: l3-ipam-dhcp
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/205631

Changed in neutron:
assignee: nobody → Assaf Muller (amuller)
status: New → In Progress
Assaf Muller (amuller)
description: updated
Kyle Mestery (mestery)
Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: Assaf Muller (amuller) → Cedric Brandily (cbrandily)
tags: added: rfe
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Why the RFE?

tags: added: l3-ipam-dhcp
removed: rfe
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Doesn't this also affect L3 agent?,

  I saw this behaviour in the past when trying to schedule a router, sometimes it was scheduled to hosts where the external network we were trying to connect to was not available, and admin had to move it manually.

Revision history for this message
Cedric Brandily (cbrandily) wrote :

@armax: this bug seems more a feature request than a real bug (at least to me/Gary/Assaf), that's why i added the rfe tag

@ajo: yes but i would prefer a specific RFE for router scheduler as it's a bit more complex: dhcps are scheduled once, routers should be rescheduled(?) every time an internal/external interface is added in order to find a l3-agent satisfying associated requirements, what should we do if we find no valid l3-agent: undeploy the router, do nothing?

Revision history for this message
Miguel Lavalle (minsel) wrote :

ZZelle and amuller working on fix: https://review.openstack.org/#/c/205631/

Revision history for this message
Cedric Brandily (cbrandily) wrote :

This bug also affects deployments delegating dhcp-agents to a specific physical network (typically providing a specific service) in order to mix dhcps related to different service (for isolation/configuration purpose).

Changed in neutron:
milestone: none → mitaka-rc1
Changed in neutron:
milestone: mitaka-rc1 → newton-1
tags: added: mitaka-rc-potential
tags: removed: mitaka-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/205631
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0267c6a5acdcb68ea7e83ecb980362c4235ed1d7
Submitter: Jenkins
Branch: master

commit 0267c6a5acdcb68ea7e83ecb980362c4235ed1d7
Author: Assaf Muller <email address hidden>
Date: Thu Jul 23 18:14:35 2015 -0400

    Make DHCP agent scheduler physical_network aware

    Currently neutron DCHP scheduler assumes that that every server running
    a dhcp-agent can reach every network. Typically the scheduler can
    wrongly schedule a vlan network on a dhcp-agent that has no reachability
    to the network it's supposed to serve (ex: network's physical_network
    not supported).

    Typically such usecase can append if:

    * physical_networks are dedicated to a specific service and we don't
      want to mix dnsmasqs related to different services (for
      isolation/configuration purpose),
    * physical_networks are dedicated to a specific rack (see example
      diagram http://i.imgur.com/NTBxRxk.png), the rack interconnection can
      be handled outside of neutron or inside when routed-networks will be
      supported.

    This change makes the DHCP scheduler network reachability aware by
    querying plugin's filter_hosts_with_network_access method.

    This change provides an implementation for ML2 plugin delegating host
    filtering to its mechanism drivers: it aggregates the filtering done by
    each mechanism or disables filtering if any mechanism doesn't overload
    default mechanism implementation[1] (for backward compatibility with
    out-of-tree mechanisms). Every in-tree mechanism overloads the default
    implementation: OVS/LB/SRIOV mechanisms use their agent mapping to filter
    hosts, l2pop/test/logger ones return empty set (they provide to "L2
    capability").

    This change provides a default implementation[2] for other plugins
    filtering nothing (for backward compatibility), they can overload it to
    provide their own implementation.

    Such host filtering has some limitations if a dhcp-agent is on a host
    handled by multiple l2 mechanisms with one mechanism claiming network
    reachability but not the one handling dhcp-agent ports. Indeed the
    host is able to reach the network but not dhcp-agent ports! Such
    limitation will be handled in a follow-up change using host+vif_type
    filtering.

    [1] neutron.plugin.ml2.driver_api.MechanismDriver.\
          filter_hosts_with_network_access
    [2] neutron.db.agents_db.AgentDbMixin.filter_hosts_with_network_access

    Closes-Bug: #1478100
    Co-Authored-By: Cedric Brandily <email address hidden>
    Change-Id: I0501d47404c8adbec4bccb84ac5980e045da68b3

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.