neutron

If neutron spawned dnsmasq dies, neutron-dhcp-agent will be totally unaware

Bug #1257524 reported by Clint Byrum on 2013-12-03

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Low	Miguel Angel Ajo

Bug Description

I recently had some trouble with dnsmasq causing it to segfault in certain situations. No doubt, this was a bug in dnsmasq. However, it was quite troubling that Neutron never noted that dnsmasq had stopped working. This is because dnsmasq is spawned as a daemon, even though it is most definitely "owned" by neutron-dhcp-agent. Also if neutron-dhcp-agent should die, since dnsmasq is a daemon it will continue to run and be "stale", requiring manual intervention to clean up. However if it is in the foreground then it will stay in neutron-dhcp-agent's process group and should also die and if need-be cleaned up by init.

I did some analysis and will not be able to dig into the actual implementation. However my analysis shows that this would work:

* use utils.create_process instead of execute and remember returned Popen object.
* spawn a greenthread to wait() on the process or create a SIGCHLD handler
* if it dies, restart it and log the error code
* pass the -k option so dnsmasq stays in foreground
* kill the process using child signals

Note sure how or if SIGCHLD plays a factor.

See original description

Tags:

Revision history for this message

Jian Wen (wenjianhn) wrote on 2013-12-04:

related to bug 1244783

Clint Byrum (clint-fewbar) on 2013-12-04

description:

updated

Jian Wen (wenjianhn) on 2013-12-04

Changed in neutron:
assignee:	nobody → Jian Wen (wenjianhn)
status:	New → In Progress

Revision history for this message

yong sheng gong (gongysh) wrote on 2013-12-04:

I think we don't want to kill dnsmasq if the dhcp-agent dies by design.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-12-04: Re: [Bug 1257524] Re: If neutron spawned dnsmasq dies, neutron-dhcp-agent will be totally unaware

Excerpts from yong sheng gong's message of 2013-12-04 02:32:13 UTC:
> I think we don't want to kill dnsmasq if the dhcp-agent dies by design.
>

I'd be interested in hearing the reasons for that and making sure they are
encoded somewhere in the documentation. I would prefer that the dnsmasq
go away with the agent, as a dnsmasq without an agent is a dnsmasq that
is disseminating outdated information.

Revision history for this message

Jian Wen (wenjianhn) wrote on 2013-12-04:

Generally we host a network by multiple DHCP agents, so it's OK if
one of the dnsmasq processes is killed.

If a user updated a port's fixed ip, the instance may get the stale IP
address from the stale dnsmasq process.

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2014-01-27:

I found the same problem, and we need some kind of solution for HA environments, and to make neutron agents more robust.

we have several different things here:

1) If a dnsmasq (or neutron-metadata-proxy) dies we might want to log it, and try restarting it (retry limited),
a) because, if dnsmasq dies (error in dnsmasq or system problem) neutron needs to be aware that this tenant network has no DHCP
b) because we want the process up again serving DHCP.

2) if we hit the respawn-retry limit we could want (but not everybody, so, I propose making a setting for that) to force the agent die. Why?

c) in the event that this is impossible, then this neutron-dhcp-agent becomes useless for at least one tenant network, if we are using some kind of HA manager on top of neutron (pacemaker, etc..) we want this tool to become aware of the situation, and respawn this agent somewhere else, or even reboot the offending host.

3) If we force neutron-dhcp-agent to stop (opossed to restart) we could want the dnsmasq to be killed, and the network namespacels cleaned up.
This could be done by netns_cleanup_util.py , but at this moment, that tool is not able to make a difference between qrouter- (l3 agent) and qdhcp- (dhcp agent) namespaces.

https://github.com/openstack/neutron/blob/master/neutron/agent/netns_cleanup_util.py

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2014-02-07:

This blueprint is related:

https://blueprints.launchpad.net/neutron/+spec/agent-service-status

Jian Wen (wenjianhn) on 2014-02-16

Changed in neutron:
status:	In Progress → Confirmed
assignee:	Jian Wen (wenjianhn) → nobody

Eugene Nikanorov (enikanorov) on 2014-06-19

Changed in neutron:
assignee:	nobody → Eugene Nikanorov (enikanorov)
importance:	Undecided → Low
tags:	added: l3-ipam-dhcp

Revision history for this message

Jian Xu (jianxu1) wrote on 2014-08-22:

any solution to recover when dnsmasg is killed by os because of system oom. when we killall dnsmasq and restart neutron dhcp agent, not single dnsmasq is respawned.

Revision history for this message

Jian Xu (jianxu1) wrote on 2014-08-22:

actually restart dhcp agent should be able to bring dnsmasq back, we found the root cause, our disk is full causing failing to bring back dsmasq processes.

goocher (farmerworking) on 2015-03-26

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

Livnat Peer (lpeer) wrote on 2015-06-13:

This issue was addressed using ProcessMonitor in the DHCP agent, in Kilo-
https://review.openstack.org/#/c/115935/

I think that the process monitor is not active by default so if you want to activate it you need to change 'check_child_processes_period' from '0' to >0 , for example :

check_child_processes_period = 60

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-06-13:

#10

True, this is fixed, Thanks Livnat!

Changed in neutron:
assignee:	Eugene Nikanorov (enikanorov) → Miguel Angel Ajo (mangelajo)
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1364712

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.