Wrong behavior when a DHCP agent goes down

Bug #1135948 reported by Roman Prykhodchenko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Roman Podoliaka

Bug Description

Steps to reproduce:
1. Set up a DHCP agent
2. Create a network with a subnet and ensure the subnet is hosted by the DHCP agent mentioned above
3. Optional: create an instance, attached to the created network.
4. Stop the DHCP agent.
5. If an instance has been launched at the step 3, terminate it.
6. Delete a network.

What happens at this point:
1. Since the agent went down the instance of dnsmaqs that was serving the network is still up and running. Configuraion files for the deleted network still exist.
2. No error was reported neither to the web ui, nor to the quantum log.

After restarting the DHCP agent the garbage won't get wiped because its message queue was destroyed.
There's no way of being informed about the error from the user side other than setting up an external monitoring environment.

There's also no way of cleaning up the garbasge other than doing some manual labor on the server side.

Tags: dhcp scheduler
description: updated
Changed in quantum:
status: New → Triaged
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

It seems the initial synchronization after the dhcp agent restart is not properly detecting deleted networks.
Detection of added networks works fine.

This is not really problematic when namespaces are enabled, but will cause errors when then another subnet will try and reuse the same cidr as the deleted one:

Stderr: '\ndnsmasq: failed to create listening socket for 10.0.11.2: Address already in use\n' execute /usr/local/lib/python2.7/dist-packages/quantum-2013.1.a67.gc507251-py2.7.egg/quantum/agent/linux/utils.py:59
2013-02-28 18:37:38.993 16352 ERROR quantum.agent.dhcp_agent [-] Unable to enable dhcp.
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent Traceback (most recent call last):
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent File "/usr/local/lib/python2.7/dist-packages/quantum-2013.1.a67.gc507251-py2.7.egg/quantum/agent/dhcp_agent.py", line 109, in call_driver
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent getattr(driver, action)()
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent File "/usr/local/lib/python2.7/dist-packages/quantum-2013.1.a67.gc507251-py2.7.egg/quantum/agent/linux/dhcp.py", line 114, in enable
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent self.spawn_process()
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent File "/usr/local/lib/python2.7/dist-packages/quantum-2013.1.a67.gc507251-py2.7.egg/quantum/agent/linux/dhcp.py", line 264, in spawn_process
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent utils.execute(cmd, self.root_helper)
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent File "/usr/local/lib/python2.7/dist-packages/quantum-2013.1.a67.gc507251-py2.7.egg/quantum/agent/linux/utils.py", line 61, in execute
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent raise RuntimeError(m)
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent RuntimeError:
2013-02-28 18:37:38.993 16352 TRACE quantum.agent.dhcp_agent Command: ['sudo', '/usr/local/bin/quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'QUANTUM_RELAY_SOCKET_PATH=/opt/stack/data/quantum/dhcp/lease_relay', 'QUANTUM_NETWORK_ID=496ded90-e6c2-4a1c-be07-d6aeb5108927', 'dnsmasq', '--no-hosts', '--no-resolv', '--strict-order', '--bind-interfaces', '--interface=tap0406faf9-27', '--except-interface=lo', '--pid-file=/opt/stack/data/quantum/dhcp/496ded90-e6c2-4a1c-be07-d6aeb5108927/pid', '--dhcp-hostsfile=/opt/stack/data/quantum/dhcp/496ded90-e6c2-4a1c-be07-d6aeb5108927/host', '--dhcp-optsfile=/opt/stack/data/quantum/dhcp/496ded90-e6c2-4a1c-be07-d6aeb5108927/opts', '--dhcp-script=/opt/stack/quantum/bin/quantum-dhcp-agent-dnsmasq-lease-update', '--leasefile-ro', '--dhcp-range=set:tag0,10.0.11.0,static,120s', '--conf-file=', '--domain=openstacklocal']

And of course, it will waste resources, which is never good.

Changed in quantum:
importance: Undecided → Medium
Revision history for this message
Phani Achanta (phani-achanta) wrote :

dhcp agent has no memory across restarts.
The dhcpagent.NetworkCache is a memory-base agent side cache.
sync_state deletes networks if they are in known networks ( NetworkCache) but not in Server known networks (by RPC)
In case of a cold start, this will give a known networks which is empty ... so there is nothing deleted.

Options:
1. Make NetworkCache a disk persistent state
2. at restart , sync_state also calls the dnsdriver to query for all its known networks ( all directories under dhcp state are created by net-uuid && they will have a host/interface file under it) . so known networks will then not be null if there were networks up prior to restart.

Changed in quantum:
assignee: nobody → Roman Podolyaka (rpodolyaka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/23704

Changed in quantum:
status: Triaged → In Progress
dan wendlandt (danwent)
Changed in quantum:
milestone: none → grizzly-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/23704
Committed: http://github.com/openstack/quantum/commit/bb43a075f79fde8e57019671aab62e14d0bb2741
Submitter: Jenkins
Branch: master

commit bb43a075f79fde8e57019671aab62e14d0bb2741
Author: Roman Podolyaka <email address hidden>
Date: Tue Mar 5 18:53:51 2013 +0200

    Fix detection of deleted networks in DHCP agent.

    The DHCP-agent uses an in-memory networks cache to find out which networks must
    be deleted and which ones must be updated. In a case of agent restart the networks
    cache is empty and it's not possible to cleanup DHCP-processes serving networks
    which were deleted while the DHCP-agent was down. The proposed fix fills the networks
    cache when the agent starts using a list of networks which have existing config files.

    Fixes: bug #1135948
    Change-Id: I27758389755cd19bca9fdbeda9cc1a123370f527

Changed in quantum:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in quantum:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in quantum:
milestone: grizzly-rc1 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.