neutron_dhcp side container is racy
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Michele Baldessari |
Bug Description
We now see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:
2020-03-11 17:58:33.737 54481 DEBUG neutron.
2020-03-11 17:58:33.737 54481 DEBUG neutron.
2020-03-11 17:58:33.738 54481 DEBUG neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.740 54481 DEBUG neutron.
The issue is that the dhcp side containers are spawned with the following processes:
| `-{conmon}(375908)
Now when neutron wants to send a SIGHUP to the dnsmasq it actually invokes the following command:
nsenter --net=/
now the problem is that podman kill will send the signal to "dumb-init --single-child" (pid1 for this container) which will then forward it only to bash, which will cause dnsmasq to be terminated and will eventually be later respawned with a different pid (stored in /var/lib/
So if multiple ports are created concurrently this is racy and one of them will fail to reload dnsmasq with the error above, because one process might use a pid file that is no longer valid.
TLDR: this all works if SIGHUP to the dnsmasq process does not change pids under the hood all of a sudden.
Changed in tripleo: | |
status: | Triaged → In Progress |
tags: | added: queens-backport-potential stein-backport-potential |
Reviewed: https:/ /review. opendev. org/712685 /git.openstack. org/cgit/ openstack/ tripleo- heat-templates/ commit/ ?id=3ca7e8f03fa ab0ee599a18bce4 217430583e2050
Committed: https:/
Submitter: Zuul
Branch: master
commit 3ca7e8f03faab0e e599a18bce42174 30583e2050
Author: Michele Baldessari <email address hidden>
Date: Thu Mar 12 15:01:25 2020 +0100
Use exec when spawning dnsmasq inside sidecar container
We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:
2020-03-11 17:58:33.737 54481 DEBUG neutron. agent.dhcp. agent [req-6caace19- 095f-4115- be85-644f7a8baa 7f - - - - -] Resync event has been scheduled _periodic_ resync_ helper /usr/lib/ python3. 6/site- packages/ neutron/ agent/dhcp/ agent.py: 277 common. utils [req-6caace19- 095f-4115- be85-644f7a8baa 7f - - - - -] Calling throttled function clear wrapper /usr/lib/ python3. 6/site- packages/ neutron/ common/ utils.py: 110 agent.dhcp. agent [req-6caace19- 095f-4115- be85-644f7a8baa 7f - - - - -] resync (a187b137- b68c-476e- bd37-39253158e7 62): [ProcessExecuti onError( "Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_ resync_ helper /usr/lib/ python3. 6/site- packages/ neutron/ agent/dhcp/ agent.py: 294 agent.dhcp. agent [-] Unable to reload_allocations dhcp for a187b137- b68c-476e- bd37-39253158e7 62.: neutron_ lib.exceptions. ProcessExecutio nError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec agent.dhcp. agent Traceback (most recent call last): agent.dhcp. agent File "/usr/lib/ python3. 6/site- packages/ neutron/ agent/dhcp/ agent.py" , line 160, in call_driver agent.dhcp. agent getattr(driver, action) (**action_ kwargs) agent.dhcp. agent File "/usr/lib/ python3. 6/site- packages/ neutron/ agent/linux/ dhcp.py" , line 528, in reload_allocations agent.dhcp. agent self._spawn_ or_reload_ process( reload_ with_HUP= True) agent.dhcp. agent File "/usr/lib/ python3. 6/site- packages/ neutron/ agent/linux/ dhcp.py" , line 470, in _spawn_ or_reload_ process agent.dhcp. agent pm.enable( reload_ cfg=reload_ with_HUP, ensure_active=True) agent.dhcp. agent File "/usr/lib/ python3. 6/site- packages/ neutron/ agent/linux/ external_ process. py", line 92, in enable agent.dhcp. agent self.reload_cfg() agent.dhcp. agent File "/usr/lib/ python3. 6/site- packages/ neutron/ agent/linux/ external_ process. py", line 100, in reload_cfg agent.dhcp. agent self.disable('HUP')
2020-03-11 17:58:33.737 54481 DEBUG neutron.
2020-03-11 17:58:33.738 54481 DEBUG neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neutron.
2020-03-11 17:58:33.738 54481 ERROR neu...