when neutron-dhcp-agent dies / moves dnsmasq is left running

Bug #1285929 reported by Andrew Woodward on 2014-02-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Sergey Vasilenko
4.1.x
High
Sergey Vasilenko
5.0.x
High
Sergey Vasilenko

Bug Description

When neutron-dhcp-agent is killed by SIGTERM or SIGKILL all of the dnsmqsq processes are left running. When neutron-dhcp-agent restarts it will remove dnsmasq instances if it starts on the same node, however when neutron-dhcp-agent is moved to another controller then dnsmqsq us permanently orphaned. According to report in IRC this caused failures to receive DHCP addresses untill crm was ordered to restart neutron-dhcp-agent again.

Vladimir Kuklin (vkuklin) wrote :

You should not perform such actions as killing of agent processes except you do all the additional cleanup steps (rescheduling of messages, port cleanup, killing dnsmasq instances and so on) by yourself. We can provide more reliable OCF resource agent for neutron services, though, but this can be left for the next release.

Changed in fuel:
milestone: none → 5.0
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)

Fix proposed to branch: master
Review: https://review.openstack.org/78178

Changed in fuel:
status: Triaged → In Progress
Sergey Vasilenko (xenolog) wrote :

experiment says: Dnsmasq doesn't die even if interface, that it use, was killed.

I see two ways for resolve this issue:
* BAD: pgrep all dnsmasq and kill each, that contains namespace name in his configuration options
* GOOD: calculate and kill all processes, running in the given network namespace.

good way requires renew iproute2 package on both OS.

Reviewed: https://review.openstack.org/78178
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=70e813d5b6b26dba0cd763ce24eab27747f4b573
Submitter: Jenkins
Branch: master

commit 70e813d5b6b26dba0cd763ce24eab27747f4b573
Author: Sergey Vasilenko <email address hidden>
Date: Wed Mar 5 15:40:04 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant to mysql and keystone temporary fails.

    In this implementation cleanup-script does not get information from Neutron API.
    Script inspects network namespaces on this node for given agent type and removes
    found ports from integration bridge.

    Closes-bug: #1287716
    Partial-bug: #1285929
    Change-Id: I2dfb31f240dca652341c4623f237f6a143414448

Andrew Woodward (xarses) on 2014-03-14
tags: added: backports-4.1.1
Brad Durrow (l-brad) wrote :

Here is another way to calculate the processes id with network ports open (both servers and clients) in a namespace:

[root@fuelpxe01 ~]# ssh node-16 ip netns exec qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i
Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq 9928 nobody 3u IPv4 2507813106 0t0 UDP *:bootps
dnsmasq 9928 nobody 5u IPv4 2507813112 0t0 UDP 10.29.8.2:domain
dnsmasq 9928 nobody 6u IPv4 2507813113 0t0 TCP 10.29.8.2:domain (LISTEN)
dnsmasq 9928 nobody 7u IPv6 2507813114 0t0 UDP [fe80::f816:3eff:fe61:1b64]:domain
dnsmasq 9928 nobody 8u IPv6 2507813115 0t0 TCP [fe80::f816:3eff:fe61:1b64]:domain (LISTEN)

If you only wanted the process id you would add a -t
[root@fuelpxe01 ~]# ssh node-16 ip netns exec qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i -t
Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
9928

I believe you could also use fuser to find the processes and kill them in one step.

This is not appropriate for the l3-agent resource (in the qrouter-* nammespace) as /usr/bin/neutron-ns-metadata-proxy is returned when you run a similiar command in the qrouter namespace.

Brad Durrow (l-brad) wrote :

I applied this patch and ultimately restarted each of the controllers in turn. It appears that the namespaces were not getting created as appropriate. This lead to a l2 loop on my network. I reverted the change.

On Sat, Mar 15, 2014 at 3:56 PM, Brad Durrow <email address hidden> wrote:

> Here is another way to calculate the processes id with network ports
> open (both servers and clients) in a namespace:
>
> [root@fuelpxe01 ~]# ssh node-16 ip netns exec
> qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i
> Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> dnsmasq 9928 nobody 3u IPv4 2507813106 0t0 UDP *:bootps
> dnsmasq 9928 nobody 5u IPv4 2507813112 0t0 UDP 10.29.8.2:domain
> dnsmasq 9928 nobody 6u IPv4 2507813113 0t0 TCP 10.29.8.2:domain
> (LISTEN)
> dnsmasq 9928 nobody 7u IPv6 2507813114 0t0 UDP
> [fe80::f816:3eff:fe61:1b64]:domain
> dnsmasq 9928 nobody 8u IPv6 2507813115 0t0 TCP
> [fe80::f816:3eff:fe61:1b64]:domain (LISTEN)
>
> If you only wanted the process id you would add a -t
> [root@fuelpxe01 ~]# ssh node-16 ip netns exec
> qdhcp-50644057-b518-4e85-843a-3321c9a4073f lsof -i -t
> Warning: Permanently added 'node-16' (RSA) to the list of known hosts.
> 9928
>

Brad, big thanks for this way. I going to implement it.

/sv

Reviewed: https://review.openstack.org/89557
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=07cf85f983da00390f4ff994e66ece8c20186e24
Submitter: Jenkins
Branch: master

commit 07cf85f983da00390f4ff994e66ece8c20186e24
Author: Sergey Vasilenko <email address hidden>
Date: Tue Apr 22 16:59:49 2014 +0400

    kill all proceses inside all dhcp-agent's net.namespaces,

    that using ip protocol. When dhcp agent stops.

    Change-Id: Ie84fdc70edaad3ab898bdb577f2dae2aeb9462d3
    Closes-Bug: #1285929

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/90220
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a4dc785d4a9796aef6cd406bb4cd764f17b0bce3
Submitter: Jenkins
Branch: stable/4.1

commit a4dc785d4a9796aef6cd406bb4cd764f17b0bce3
Author: Sergey Vasilenko <email address hidden>
Date: Tue Apr 22 16:59:49 2014 +0400

    kill all proceses inside all dhcp-agent's net.namespaces,

    that using ip protocol. When dhcp agent stops.

    Change-Id: Ie84fdc70edaad3ab898bdb577f2dae2aeb9462d3
    Closes-Bug: #1285929

Andrew Woodward (xarses) on 2014-05-08
tags: added: ha

Reviewed: https://review.openstack.org/96840
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=7f3a94c1df5564dd7117b9f5c702da8bcf0c03fe
Submitter: Jenkins
Branch: stable/4.1

commit 7f3a94c1df5564dd7117b9f5c702da8bcf0c03fe
Author: Sergey Vasilenko <email address hidden>
Date: Wed Mar 5 15:40:04 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant to mysql and keystone temporary fails.

    In this implementation cleanup-script does not get information from Neutron API.
    Script inspects network namespaces on this node for given agent type and removes
    found ports from integration bridge.

    Closes-bug: #1287716
    Partial-bug: #1285929
    Change-Id: I2dfb31f240dca652341c4623f237f6a143414448

Meg McRoberts (dreidellhasa) wrote :

Documented in 4.1.1 Release Notes

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers