neutron_dhcp side container is racy

Bug #1867192 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

We now see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:
2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277
2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110
2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
+ trap 'exec 2>&4 1>&3' 0 1 2 3
+ exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg()
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP')
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent utils.execute(cmd, run_as_root=self.run_as_root)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent returncode=returncode)
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + trap 'exec 2>&4 1>&3' 0 1 2 3
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent + exec
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent
2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent
2020-03-11 17:58:33.740 54481 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a187b137-b68c-476e-bd37-39253158e762.pid.haproxy get_value_from_file /usr/lib/pyt

The issue is that the dhcp side containers are spawned with the following processes:
           |-conmon(375906)-+-dumb-init(375918)---bash(375935)---dnsmasq(375938)
           | `-{conmon}(375908)

Now when neutron wants to send a SIGHUP to the dnsmasq it actually invokes the following command:
nsenter --net=/run/netns/qdhcp-e11ac152-745f-4292-b423-d282fbf97f13 --preserve-credentials -m -t 1 podman kill --signal HUP 0b7371f6de52cfe377858...

now the problem is that podman kill will send the signal to "dumb-init --single-child" (pid1 for this container) which will then forward it only to bash, which will cause dnsmasq to be terminated and will eventually be later respawned with a different pid (stored in /var/lib/neutron/dhcp/86884abc-f7d7-4118-923f-38b247fee8e9/pid).

So if multiple ports are created concurrently this is racy and one of them will fail to reload dnsmasq with the error above, because one process might use a pid file that is no longer valid.
TLDR: this all works if SIGHUP to the dnsmasq process does not change pids under the hood all of a sudden.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)
Download full text (5.8 KiB)

Reviewed: https://review.opendev.org/712685
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3ca7e8f03faab0ee599a18bce4217430583e2050
Submitter: Zuul
Branch: master

commit 3ca7e8f03faab0ee599a18bce4217430583e2050
Author: Michele Baldessari <email address hidden>
Date: Thu Mar 12 15:01:25 2020 +0100

    Use exec when spawning dnsmasq inside sidecar container

    We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:

    2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277
    2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110
    2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
    + trap 'exec 2>&4 1>&3' 0 1 2 3
    + exec
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg()
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP')
    2020-03-11 17:58:33.738 54481 ERROR neu...

Read more...

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/712948

tags: added: queens-backport-potential stein-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)
Download full text (5.9 KiB)

Reviewed: https://review.opendev.org/712948
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=76de3fd931ca2e1e162fc2f5f14fcdd76cbbc851
Submitter: Zuul
Branch: stable/train

commit 76de3fd931ca2e1e162fc2f5f14fcdd76cbbc851
Author: Michele Baldessari <email address hidden>
Date: Thu Mar 12 15:01:25 2020 +0100

    Use exec when spawning dnsmasq inside sidecar container

    We see some deployment failures where the overcloud is unable to PXE/DHCP boot during the initial bits of the deployments. The following errors are seen in neutron dhcp logs:

    2020-03-11 17:58:33.737 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Resync event has been scheduled _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:277
    2020-03-11 17:58:33.737 54481 DEBUG neutron.common.utils [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] Calling throttled function clear wrapper /usr/lib/python3.6/site-packages/neutron/common/utils.py:110
    2020-03-11 17:58:33.738 54481 DEBUG neutron.agent.dhcp.agent [req-6caace19-095f-4115-be85-644f7a8baa7f - - - - -] resync (a187b137-b68c-476e-bd37-39253158e762): [ProcessExecutionError("Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec\n+ trap 'exec 2>&4 1>&3' 0 1 2 3\n+ exec\n",)] _periodic_resync_helper /usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:294
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent [-] Unable to reload_allocations dhcp for a187b137-b68c-476e-bd37-39253158e762.: neutron_lib.exceptions.ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: ; Stderr: + exec
    + trap 'exec 2>&4 1>&3' 0 1 2 3
    + exec
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 160, in call_driver
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 528, in reload_allocations
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=True)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/dhcp.py", line 470, in _spawn_or_reload_process
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP, ensure_active=True)
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 92, in enable
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.reload_cfg()
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 100, in reload_cfg
    2020-03-11 17:58:33.738 54481 ERROR neutron.agent.dhcp.agent self.disable('HUP')
    2020-03-11 17:58:33.738 54481 ERR...

Read more...

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/716763

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/716763
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5b799136facc15d4e69bcede52b60d39a4a02464
Submitter: Zuul
Branch: master

commit 5b799136facc15d4e69bcede52b60d39a4a02464
Author: Bernard Cafarelli <email address hidden>
Date: Wed Apr 1 22:06:59 2020 +0200

    Use exec when spawning any neutron sidecar container

    Similarly to dnsmasq [0], other processes can receive SIGHUP. This is
    allowed by rootwrap filters for all processes [1], and I found some
    examples when running neutron-tempest-plugin tests checking
    l3-agent.log files:
    Running command (rootwrap daemon): ['radvd-kill', 'HUP', '712810'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
    Running command (rootwrap daemon): ['keepalived-kill', 'HUP', '402009'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103

    To prevent additional similar issues, apply similar fix for these
    sidecar containers too

    Related-Bug: #1867192

    [0] I1af2ecd9e3996de4f43224f66a8bdb81eab07022
    [1] https://opendev.org/openstack/neutron/src/branch/master/etc/neutron/rootwrap.d

    Change-Id: I31237d21527a2909a1669cb6c80cc0fa9be798a6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/717403

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/717403
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cd29180c34f59c62e53704ff10e5de4158f1d5d2
Submitter: Zuul
Branch: stable/train

commit cd29180c34f59c62e53704ff10e5de4158f1d5d2
Author: Bernard Cafarelli <email address hidden>
Date: Wed Apr 1 22:06:59 2020 +0200

    Use exec when spawning any neutron sidecar container

    Similarly to dnsmasq [0], other processes can receive SIGHUP. This is
    allowed by rootwrap filters for all processes [1], and I found some
    examples when running neutron-tempest-plugin tests checking
    l3-agent.log files:
    Running command (rootwrap daemon): ['radvd-kill', 'HUP', '712810'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
    Running command (rootwrap daemon): ['keepalived-kill', 'HUP', '402009'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103

    To prevent additional similar issues, apply similar fix for these
    sidecar containers too

    Related-Bug: #1867192

    [0] I1af2ecd9e3996de4f43224f66a8bdb81eab07022
    [1] https://opendev.org/openstack/neutron/src/branch/master/etc/neutron/rootwrap.d

    Change-Id: I31237d21527a2909a1669cb6c80cc0fa9be798a6
    (cherry picked from commit 5b799136facc15d4e69bcede52b60d39a4a02464)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.2.0

This issue was fixed in the openstack/tripleo-heat-templates 12.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/713395
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0c88c8898d44f9608ca33b8699b5c70bb5124f3b
Submitter: Zuul
Branch: master

commit 0c88c8898d44f9608ca33b8699b5c70bb5124f3b
Author: Michele Baldessari <email address hidden>
Date: Tue Mar 17 11:17:48 2020 +0100

    Use exec for ovn metadata agent

    Let's use exec also for the ovn metadata agent.
    This was already done in the same vein via
    I31237d21527a2909a1669cb6c80cc0fa9be798a6
    I1af2ecd9e3996de4f43224f66a8bdb81eab07022

    for dnsmasq and other neutron sidecar containers.
    In this case it does not fix anything specifically
    bit it avoids a spurious bash process on the system
    so it makes sense nonetheless.

    Related-Bug: #1867192

    Change-Id: If4b9c38e8cd2c531cbc984f6f79fb79111204b25

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.