when neutron-dhcp-agent dies / moves dnsmasq is left running

Bug #1285929 reported by Andrew Woodward
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Sergey Vasilenko
4.1.x
Fix Released
High
Sergey Vasilenko
5.0.x
Fix Released
High
Sergey Vasilenko

Bug Description

When neutron-dhcp-agent is killed by SIGTERM or SIGKILL all of the dnsmqsq processes are left running. When neutron-dhcp-agent restarts it will remove dnsmasq instances if it starts on the same node, however when neutron-dhcp-agent is moved to another controller then dnsmqsq us permanently orphaned. According to report in IRC this caused failures to receive DHCP addresses untill crm was ordered to restart neutron-dhcp-agent again.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

You should not perform such actions as killing of agent processes except you do all the additional cleanup steps (rescheduling of messages, port cleanup, killing dnsmasq instances and so on) by yourself. We can provide more reliable OCF resource agent for neutron services, though, but this can be left for the next release.

Changed in fuel:
milestone: none → 5.0
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/78178

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

experiment says: Dnsmasq doesn't die even if interface, that it use, was killed.

I see two ways for resolve this issue:
* BAD: pgrep all dnsmasq and kill each, that contains namespace name in his configuration options
* GOOD: calculate and kill all processes, running in the given network namespace.

good way requires renew iproute2 package on both OS.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/78178
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=70e813d5b6b26dba0cd763ce24eab27747f4b573
Submitter: Jenkins
Branch: master

commit 70e813d5b6b26dba0cd763ce24eab27747f4b573
Author: Sergey Vasilenko <email address hidden>
Date: Wed Mar 5 15:40:04 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant to mysql and keystone temporary fails.

    In this implementation cleanup-script does not get information from Neutron API.
    Script inspects network namespaces on this node for given agent type and removes
    found ports from integration bridge.

    Closes-bug: #1287716
    Partial-bug: #1285929
    Change-Id: I2dfb31f240dca652341c4623f237f6a143414448

Andrew Woodward (xarses)
tags: added: backports-4.1.1
Revision history for this message
Brad Durrow (l-brad) wrote :

Here is another way to calculate the processes id with network ports open (both servers and clients) in a namespace:

[root@fuelpxe01 ~]# ssh node-16 ip netns exec qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i
Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq 9928 nobody 3u IPv4 2507813106 0t0 UDP *:bootps
dnsmasq 9928 nobody 5u IPv4 2507813112 0t0 UDP 10.29.8.2:domain
dnsmasq 9928 nobody 6u IPv4 2507813113 0t0 TCP 10.29.8.2:domain (LISTEN)
dnsmasq 9928 nobody 7u IPv6 2507813114 0t0 UDP [fe80::f816:3eff:fe61:1b64]:domain
dnsmasq 9928 nobody 8u IPv6 2507813115 0t0 TCP [fe80::f816:3eff:fe61:1b64]:domain (LISTEN)

If you only wanted the process id you would add a -t
[root@fuelpxe01 ~]# ssh node-16 ip netns exec qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i -t
Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
9928

I believe you could also use fuser to find the processes and kill them in one step.

This is not appropriate for the l3-agent resource (in the qrouter-* nammespace) as /usr/bin/neutron-ns-metadata-proxy is returned when you run a similiar command in the qrouter namespace.

Revision history for this message
Brad Durrow (l-brad) wrote :

I applied this patch and ultimately restarted each of the controllers in turn. It appears that the namespaces were not getting created as appropriate. This lead to a l2 loop on my network. I reverted the change.

Revision history for this message
Sergey Vasilenko (xenolog) wrote : Re: [Bug 1285929] Re: when neutron-dhcp-agent dies / moves dnsmasq is left running

On Sat, Mar 15, 2014 at 3:56 PM, Brad Durrow <email address hidden> wrote:

> Here is another way to calculate the processes id with network ports
> open (both servers and clients) in a namespace:
>
> [root@fuelpxe01 ~]# ssh node-16 ip netns exec
> qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i
> Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> dnsmasq 9928 nobody 3u IPv4 2507813106 0t0 UDP *:bootps
> dnsmasq 9928 nobody 5u IPv4 2507813112 0t0 UDP 10.29.8.2:domain
> dnsmasq 9928 nobody 6u IPv4 2507813113 0t0 TCP 10.29.8.2:domain
> (LISTEN)
> dnsmasq 9928 nobody 7u IPv6 2507813114 0t0 UDP
> [fe80::f816:3eff:fe61:1b64]:domain
> dnsmasq 9928 nobody 8u IPv6 2507813115 0t0 TCP
> [fe80::f816:3eff:fe61:1b64]:domain (LISTEN)
>
> If you only wanted the process id you would add a -t
> [root@fuelpxe01 ~]# ssh node-16 ip netns exec
> qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i -t
> Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
> 9928
>

Brad, big thanks for this way. I going to implement it.

/sv

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/89557

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/89557
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=07cf85f983da00390f4ff994e66ece8c20186e24
Submitter: Jenkins
Branch: master

commit 07cf85f983da00390f4ff994e66ece8c20186e24
Author: Sergey Vasilenko <email address hidden>
Date: Tue Apr 22 16:59:49 2014 +0400

    kill all proceses inside all dhcp-agent's net.namespaces,

    that using ip protocol. When dhcp agent stops.

    Change-Id: Ie84fdc70edaad3ab898bdb577f2dae2aeb9462d3
    Closes-Bug: #1285929

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/90220

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/90220
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a4dc785d4a9796aef6cd406bb4cd764f17b0bce3
Submitter: Jenkins
Branch: stable/4.1

commit a4dc785d4a9796aef6cd406bb4cd764f17b0bce3
Author: Sergey Vasilenko <email address hidden>
Date: Tue Apr 22 16:59:49 2014 +0400

    kill all proceses inside all dhcp-agent's net.namespaces,

    that using ip protocol. When dhcp agent stops.

    Change-Id: Ie84fdc70edaad3ab898bdb577f2dae2aeb9462d3
    Closes-Bug: #1285929

Revision history for this message
Andrew Woodward (xarses) wrote :
Andrew Woodward (xarses)
tags: added: ha
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/96840

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/96840
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=7f3a94c1df5564dd7117b9f5c702da8bcf0c03fe
Submitter: Jenkins
Branch: stable/4.1

commit 7f3a94c1df5564dd7117b9f5c702da8bcf0c03fe
Author: Sergey Vasilenko <email address hidden>
Date: Wed Mar 5 15:40:04 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant to mysql and keystone temporary fails.

    In this implementation cleanup-script does not get information from Neutron API.
    Script inspects network namespaces on this node for given agent type and removes
    found ports from integration bridge.

    Closes-bug: #1287716
    Partial-bug: #1285929
    Change-Id: I2dfb31f240dca652341c4623f237f6a143414448

Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Documented in 4.1.1 Release Notes

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.