[DHCP] Check the dnsmasq process status after enabling the process

Bug #1849502 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Brian Haley

Bug Description

Hello:

Version: Train (but also reproducible in master)
Environment: 1 network with 3 subnets, and one host for the DHCP agent.

We found a race condition between the DHCP agent driver call and the dnsmasq process status.

When a deployment process starts, a network is created with three subnets. The subnet creation timestamp is:
```
server.log.3:1828:2019-10-21 22:29:44.648 24 DEBUG neutron_lib.callbacks.manager [req-91242e64-648f-45e8-a5eb-de18c9301ea0 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1875:2019-10-21 22:29:45.929 24 DEBUG neutron_lib.callbacks.manager [req-27ed89df-cbb0-4a3e-90db-208401560889 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1974:2019-10-21 22:29:47.846 24 DEBUG neutron_lib.callbacks.manager [req-113792ac-9649-443e-9ee3-102ea02f5bd5 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
```

As we can see in [1], the DHCP is informed about the different changes in the network and reacts depending on the current status. The first time, the DHCP process is enabled (no other process is running at this point). This happens at 22:29:45.340, between first subnet and second subnet creation. The network information at this point has only one subnet, retrieved by the DHCP agent in:
```
2019-10-21 22:29:45.152 27 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-535e0231-b69e-4136-8ce4-aad14c6fc9ac - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```

The next driver call is a "restart". This happens when 2nd and 3rd subnets are created. When the DHCP agent calls the server to retrieve the network information, this network has all three subnets:
```
2019-10-21 22:29:49.135 26 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```

What is happening in the "restart" process is a bit weird [2]. The process should be stopped first and then restarted with the new cmd line (including the new subnet dhcp-ranges --> the main problem detected in this bug). But the dnsmasq process is not running yet:
```
dhcp-agent.log.2:442:2019-10-21 22:29:49.417 57623 DEBUG neutron.agent.linux.external_process [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] No dnsmasq process started for 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:121
```

The DHCP agent started the dnsmasq process in:
```
2019-10-21 22:29:48.496 57623 DEBUG neutron.agent.linux.utils [req-9de39ee9-2b4f-44c1-be47-46922943dd17 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/host', '--addn-hosts=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1500', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
```

That means one second before the DHCP agent tried to stop it again.

To avoid this, when the DHCP agent starts the dnsmasq process, we need to ensure that ProcessManager() check after enabling the process that this process is active. If we ensure this, in the next agent call (and driver call - dnsmasq), the process for a specific network is up and running, and ProcessManager can detect the process is active.

Regards.

[1] http://paste.openstack.org/show/785519/
[2] http://paste.openstack.org/show/785520/

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690700

Changed in neutron:
status: New → In Progress
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

Isn't this also root cause of issue described in https://bugs.launchpad.net/neutron/+bug/1848738?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Slawek:

The problem is very similar (if not the same), but the root cause described and the patches submitted are completely different.

In 1848738, the problem you are describing is due to the subnet (dhcp-range) tag assignation. In this bug description, the problem I see is a race condition between the moment the server is spawned and the moment this same dnsmas process is restarted again because there is a subnet change. Actually the problem is between the moment the DHCP agent tries to stop it and when the agent tries to start it again:
- Agent tries to stop it, but does not find the PID or the process running. The process then is not stopped.
- Agent tries to start the process (350 ms later), but at this moment, the process is active and the agent (using ProcessManager) does nothing because considers that the dnsmasq process is running and there is no need to execute the command

===> this will lead us to the problem we have: the dnsmasq process does not restart with the new subnet configuration (new added subnets) and the --dhcp-range parameters won't be updated.

Regards.

Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/690700
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Submitter: Zuul
Branch: master

commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned

    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.

    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.

    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.

    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/691858

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/691858
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ce3f2f7d2690eec0dcab4073cbf357b683c6394b
Submitter: Zuul
Branch: stable/train

commit ce3f2f7d2690eec0dcab4073cbf357b683c6394b
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned

    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.

    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.

    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.

    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502
    (cherry picked from commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/694250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/694250
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2d8613e3c4d29a4bb5be81bd78d749e0f448cffe
Submitter: Zuul
Branch: stable/stein

commit 2d8613e3c4d29a4bb5be81bd78d749e0f448cffe
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned

    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.

    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.

    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.

    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502
    (cherry picked from commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee)
    (cherry picked from commit ce3f2f7d2690eec0dcab4073cbf357b683c6394b)

tags: added: in-stable-stein
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.4

This issue was fixed in the openstack/neutron 14.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.1

This issue was fixed in the openstack/neutron 15.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/718206

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/718207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/718206
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fc365c8c0d876f4cab09ba10c99277cb57714c4c
Submitter: Zuul
Branch: stable/rocky

commit fc365c8c0d876f4cab09ba10c99277cb57714c4c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned

    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.

    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.

    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.

    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502
    (cherry picked from commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee)
    (cherry picked from commit ce3f2f7d2690eec0dcab4073cbf357b683c6394b)
    (cherry picked from commit 2d8613e3c4d29a4bb5be81bd78d749e0f448cffe)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/718207
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e28e767bd6d0b3cbe64b929c07d3d82d683a074b
Submitter: Zuul
Branch: stable/queens

commit e28e767bd6d0b3cbe64b929c07d3d82d683a074b
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 23 14:47:54 2019 +0000

    Check dnsmasq process is active when spawned

    After spawning the "dnsmasq" process in the method
    "Dnsmasq._spawn_or_reload_process", we need to check that the "dnsmasq"
    process is running and could be detected by the ProcessManager instance
    controlling it.

    ProcessManager determines if a process is "active":
    - If the network ID is in the cmdline used to execute the process.
    - If the process is detected by psutil.Process(pid), returning the
      cmdline needed in the first condition.
    - If the PID file exists; this is written by the dnsmasq process
      once is started and is needed in the second condition.

    To make this feature available for any other process using
    ProcessManager, the implementation is done in this class.

    Change-Id: I51dc9d342c613afcbcfdc50a1d2811502748f170
    Closes-Bug: #1849502
    (cherry picked from commit 7c5ce50a0c9e09d2729ae1ce79d2623ccafca9ee)
    (cherry picked from commit ce3f2f7d2690eec0dcab4073cbf357b683c6394b)
    (cherry picked from commit 2d8613e3c4d29a4bb5be81bd78d749e0f448cffe)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron queens-eol

This issue was fixed in the openstack/neutron queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron rocky-eol

This issue was fixed in the openstack/neutron rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.