If neutron spawned dnsmasq dies, neutron-dhcp-agent will be totally unaware

Bug #1257524 reported by Clint Byrum
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Miguel Angel Ajo

Bug Description

I recently had some trouble with dnsmasq causing it to segfault in certain situations. No doubt, this was a bug in dnsmasq. However, it was quite troubling that Neutron never noted that dnsmasq had stopped working. This is because dnsmasq is spawned as a daemon, even though it is most definitely "owned" by neutron-dhcp-agent. Also if neutron-dhcp-agent should die, since dnsmasq is a daemon it will continue to run and be "stale", requiring manual intervention to clean up. However if it is in the foreground then it will stay in neutron-dhcp-agent's process group and should also die and if need-be cleaned up by init.

I did some analysis and will not be able to dig into the actual implementation. However my analysis shows that this would work:

* use utils.create_process instead of execute and remember returned Popen object.
* spawn a greenthread to wait() on the process or create a SIGCHLD handler
* if it dies, restart it and log the error code
* pass the -k option so dnsmasq stays in foreground
* kill the process using child signals

Note sure how or if SIGCHLD plays a factor.

Tags: l3-ipam-dhcp
Revision history for this message
Jian Wen (wenjianhn) wrote :

related to bug 1244783

description: updated
Jian Wen (wenjianhn)
Changed in neutron:
assignee: nobody → Jian Wen (wenjianhn)
status: New → In Progress
Revision history for this message
yong sheng gong (gongysh) wrote :

I think we don't want to kill dnsmasq if the dhcp-agent dies by design.

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 1257524] Re: If neutron spawned dnsmasq dies, neutron-dhcp-agent will be totally unaware

Excerpts from yong sheng gong's message of 2013-12-04 02:32:13 UTC:
> I think we don't want to kill dnsmasq if the dhcp-agent dies by design.
>

I'd be interested in hearing the reasons for that and making sure they are
encoded somewhere in the documentation. I would prefer that the dnsmasq
go away with the agent, as a dnsmasq without an agent is a dnsmasq that
is disseminating outdated information.

Revision history for this message
Jian Wen (wenjianhn) wrote :

Generally we host a network by multiple DHCP agents, so it's OK if
one of the dnsmasq processes is killed.

If a user updated a port's fixed ip, the instance may get the stale IP
address from the stale dnsmasq process.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I found the same problem, and we need some kind of solution for HA environments, and to make neutron agents more robust.

we have several different things here:

1) If a dnsmasq (or neutron-metadata-proxy) dies we might want to log it, and try restarting it (retry limited),
     a) because, if dnsmasq dies (error in dnsmasq or system problem) neutron needs to be aware that this tenant network has no DHCP
     b) because we want the process up again serving DHCP.

2) if we hit the respawn-retry limit we could want (but not everybody, so, I propose making a setting for that) to force the agent die. Why?

     c) in the event that this is impossible, then this neutron-dhcp-agent becomes useless for at least one tenant network, if we are using some kind of HA manager on top of neutron (pacemaker, etc..) we want this tool to become aware of the situation, and respawn this agent somewhere else, or even reboot the offending host.

3) If we force neutron-dhcp-agent to stop (opossed to restart) we could want the dnsmasq to be killed, and the network namespacels cleaned up.
      This could be done by netns_cleanup_util.py , but at this moment, that tool is not able to make a difference between qrouter- (l3 agent) and qdhcp- (dhcp agent) namespaces.

     https://github.com/openstack/neutron/blob/master/neutron/agent/netns_cleanup_util.py

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :
Jian Wen (wenjianhn)
Changed in neutron:
status: In Progress → Confirmed
assignee: Jian Wen (wenjianhn) → nobody
Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
importance: Undecided → Low
tags: added: l3-ipam-dhcp
Revision history for this message
Jian Xu (jianxu1) wrote :

any solution to recover when dnsmasg is killed by os because of system oom. when we killall dnsmasq and restart neutron dhcp agent, not single dnsmasq is respawned.

Revision history for this message
Jian Xu (jianxu1) wrote :

actually restart dhcp agent should be able to bring dnsmasq back, we found the root cause, our disk is full causing failing to bring back dsmasq processes.

goocher (farmerworking)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Livnat Peer (lpeer) wrote :

This issue was addressed using ProcessMonitor in the DHCP agent, in Kilo-
https://review.openstack.org/#/c/115935/

I think that the process monitor is not active by default so if you want to activate it you need to change 'check_child_processes_period' from '0' to >0 , for example :

check_child_processes_period = 60

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

True, this is fixed, Thanks Livnat!

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Miguel Angel Ajo (mangelajo)
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.