DHCP agents time out during startup at 60s when there is enough agents

Bug #1849676 reported by Olli Tuovinen on 2019-10-24
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Rodolfo Alonso

Bug Description

The following introduces a 60s timeout to DHCP agent startups:

https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a#diff-3fcbcfeebb7de79a1cb36faed9b8b091

The value is not adjustable from conf.

When there's enough network elements (ie. ~1200 DHCP enabled subnets in our case), nearly all DHCP startups fail with:

2019-10-09 13:21:27.826 694156 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network 8b4b5496-8b35-482e-a2a3-7c352f1e343a: WaitTimeout: Timed out after 60 seconds

Timeout happens due to operations happening in sequence with 100-300ms between each operation, and too many agents tried at the same time:

2019-10-09 12:24:17.836 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-13648243-5659-4094-9dd7-cee58e4d46ac', 'ip', '-o', 'link', 'show', 'tap3af1ee05-2f'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.075 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1407a320-8fca-4dfd-a011-96a2ad41779f', 'ip', '-o', 'link', 'show', 'tap21c443e1-89'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.266 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1412976d-4cd0-452a-91b7-7f8c3003c722', 'ip', '-o', 'link', 'show', 'tap1cb0995e-bd'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.541 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1433d8a6-fa06-4544-994a-d38b01302490', 'ip', '-o', 'link', 'show', 'tap00fcd6f1-b1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.735 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1447a8cf-e94c-4250-b9a8-2b13c0cf60c6', 'ip', '-o', 'link', 'show', 'tap91076dbc-83'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.930 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-14b396e1-d561-4087-990d-9b993cc08619', 'ip', '-o', 'link', 'show', 'tapdf767a28-af'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103

The following allows the agents to start:

- common_utils.wait_until_true(self._enable)
+ common_utils.wait_until_true(self._enable, timeout=300)

Few ways to solve this issue:
- Increase default timeout from 60s to a bigger number
- Make the timeout dhcp conf adjustable
- Figure out more optimal batch sizes of DHCP agents to be started at a time, and increase the startup performance

Brian Haley (brian-haley) wrote :

There was a somewhat related change proposed recently that could better handle a "bulk" operation in the dhcp-agent, https://review.opendev.org/#/c/682241/ - I wonder if that would help here?

tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium

Hello:

First of all, let me say that this problem is happening in Queens (but also reproducible in newer branches) and using PY2.

The patch introducing this regression is [1], because now we are checking with a timeout when the dnsmasq process is enabled (actually when the dnsmasq process finished the _enabled function correctly, because this won't check the status of the process, but this another issue).

In PY2, due to the amount of "root" commands executed (in Queens, the pyroute2 migration was starting) and the problem this Python version has with process.Popen() (15x slower than PY3), the DHCP start operation in a heavy loaded server takes more than this time. process.Popen() is used to communicate to rootwrap daemon to execute the "root" commands.

As a quick fix for this regression, we need to increase the timeout time as described in the bug description. This won't have the same impact when using PY3.

Regards.

[1] https://review.opendev.org/#/q/I18d73b787fa3ab8803e12d5e5eb2bb7109205aba

Olli Tuovinen (otuovinen) wrote :
Download full text (9.5 KiB)

Hello,

I managed to reproduce this in a test environment with 1000 dhcp enabled subnets.
Python 3.6 is faster (or at least chattier), but the same can still be seen.

Same 100-300ms exec delays with rootwrap visible.

RH with collections+pip for deps, so paths look interesting.

2019-10-28 15:51:02.510 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0ac39b4a-4517-466c-8857-54e4c9e8d416', 'ip', '-o', 'link', 'show', 'tapd1a110f8-06'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:02.807 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-01f0c21c-af3f-483f-b44c-77fa39862430', 'ip', '-o', 'link', 'show', 'tap04086413-55'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.088 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0ad0ae7b-3632-48be-9322-1177a82c17a6', 'ip', '-o', 'link', 'show', 'tapef523a14-5e'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.383 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-05063b4c-a030-4b92-af53-f47b19b64b9b', 'ip', '-o', 'link', 'show', 'tap9889d494-86'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.688 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-070c6f88-5000-4a01-a100-36c914a6e219', 'ip', '-o', 'link', 'show', 'tap0e6dc9db-49'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.981 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-04b2ae20-eecf-481b-bf63-a8809b454e97', 'ip', '-o', 'link', 'show', 'tapcc06a136-82'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.270 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0972f344-e157-48ec-a94d-25e5b63fc517', 'ip', '-o', 'link', 'show', 'tap2b239e1c-a7'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.555 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-010bf420-3113-4a5d-9598-7726777468f2', 'ip', 'link', 'set', 'tapde888fdc-0e', 'up'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.857 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0840e054-f880-4404-a0e6-5c96a3088626', 'ip', '-o', 'link', 'show', 'tap49ac8943-28'] execute_rootwrap_daemon /opt/rh/rh-python36/r...

Read more...

Fix proposed to branch: master
Review: https://review.opendev.org/692160

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
status: New → In Progress

Reviewed: https://review.opendev.org/692160
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aedc09917680bc77501f4b57142e73768e3d06ad
Submitter: Zuul
Branch: master

commit aedc09917680bc77501f4b57142e73768e3d06ad
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/692376
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=44c838e25f8d85063cca52c9581c423539604f6c
Submitter: Zuul
Branch: stable/train

commit 44c838e25f8d85063cca52c9581c423539604f6c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/692378
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff0b7b00194d7ee38892c6c08de14f4da056ab2d
Submitter: Zuul
Branch: stable/rocky

commit ff0b7b00194d7ee38892c6c08de14f4da056ab2d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/692379
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dfbfa8ced3c96ec5e2649f47d8ac9f4d5b0e168e
Submitter: Zuul
Branch: stable/queens

commit dfbfa8ced3c96ec5e2649f47d8ac9f4d5b0e168e
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/692377
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1bf02756f4ae5a6afb75ae171bc7c092a03a7bb6
Submitter: Zuul
Branch: stable/stein

commit 1bf02756f4ae5a6afb75ae171bc7c092a03a7bb6
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-stein

This issue was fixed in the openstack/neutron 13.0.6 release.

This issue was fixed in the openstack/neutron 14.0.4 release.

This issue was fixed in the openstack/neutron 15.0.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.