Comment 0 for bug 1849502

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Version: Train (but also reproducible in master)
Environment: 1 network with 3 subnets, and one host for the DHCP agent.

We found a race condition between the DHCP agent driver call and the dnsmasq process status.

When a deployment process starts, a network is created with three subnets. The subnet creation timestamp is:
```
server.log.3:1828:2019-10-21 22:29:44.648 24 DEBUG neutron_lib.callbacks.manager [req-91242e64-648f-45e8-a5eb-de18c9301ea0 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1875:2019-10-21 22:29:45.929 24 DEBUG neutron_lib.callbacks.manager [req-27ed89df-cbb0-4a3e-90db-208401560889 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
server.log.3:1974:2019-10-21 22:29:47.846 24 DEBUG neutron_lib.callbacks.manager [req-113792ac-9649-443e-9ee3-102ea02f5bd5 ae0c1e62ea1c4a14952ac2a64ef54690 10198262da2c44f1a9ec9e63084111d2 - default default] Notify callbacks ['neutron.services.segments.plugin.SegmentHostRoutes.host_routes_before_create-930335'] for subnet, before_create _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
```

As we can see in [1], the DHCP is informed about the different changes in the network and reacts depending on the current status. The first time, the DHCP process is enabled (no other process is running at this point). This happens at 22:29:45.340, between first subnet and second subnet creation. The network information at this point has only one subnet, retrieved by the DHCP agent in:
```
2019-10-21 22:29:45.152 27 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-535e0231-b69e-4136-8ce4-aad14c6fc9ac - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```

The next driver call is a "restart". This happens when 2nd and 3rd subnets are created. When the DHCP agent calls the server to retrieve the network information, this network has all three subnets:
```
2019-10-21 22:29:49.135 26 DEBUG neutron.api.rpc.handlers.dhcp_rpc [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] Network 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb requested from site-undercloud-0.localdomain get_network_info /usr/lib/python3.6/site-packages/neutron/api/rpc/handlers/dhcp_rpc.py:200
```

What is happening in the "restart" process is a bit weird [2]. The process should be stopped first and then restarted with the new cmd line (including the new subnet dhcp-ranges --> the main problem detected in this bug). But the dnsmasq process is not running yet:
```
dhcp-agent.log.2:442:2019-10-21 22:29:49.417 57623 DEBUG neutron.agent.linux.external_process [req-29554255-1adb-4ec7-8655-39581e7fa72f - - - - -] No dnsmasq process started for 8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb disable /usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py:121
```

The DHCP agent started the dnsmasq process in:
```
2019-10-21 22:29:48.496 57623 DEBUG neutron.agent.linux.utils [req-9de39ee9-2b4f-44c1-be47-46922943dd17 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/host', '--addn-hosts=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/8c9e6c68-86ef-4bb0-b3fa-9a36a71d0ccb/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1500', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
```

That means one second before the DHCP agent tried to stop it again.

To avoid this, when the DHCP agent starts the dnsmasq process, we need to ensure that ProcessManager() check after enabling the process that this process is active. If we ensure this, in the next agent call (and driver call - dnsmasq), the process for a specific network is up and running, and ProcessManager can detect the process is active.

Regards.

[1] http://pastebin.test.redhat.com/808183
[2] http://pastebin.test.redhat.com/808187