DHCP agents time out during startup at 60s when there is enough agents

Bug #1849676 reported by Olli Tuovinen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

The following introduces a 60s timeout to DHCP agent startups:

https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a#diff-3fcbcfeebb7de79a1cb36faed9b8b091

The value is not adjustable from conf.

When there's enough network elements (ie. ~1200 DHCP enabled subnets in our case), nearly all DHCP startups fail with:

2019-10-09 13:21:27.826 694156 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network 8b4b5496-8b35-482e-a2a3-7c352f1e343a: WaitTimeout: Timed out after 60 seconds

Timeout happens due to operations happening in sequence with 100-300ms between each operation, and too many agents tried at the same time:

2019-10-09 12:24:17.836 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-13648243-5659-4094-9dd7-cee58e4d46ac', 'ip', '-o', 'link', 'show', 'tap3af1ee05-2f'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.075 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1407a320-8fca-4dfd-a011-96a2ad41779f', 'ip', '-o', 'link', 'show', 'tap21c443e1-89'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.266 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1412976d-4cd0-452a-91b7-7f8c3003c722', 'ip', '-o', 'link', 'show', 'tap1cb0995e-bd'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.541 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1433d8a6-fa06-4544-994a-d38b01302490', 'ip', '-o', 'link', 'show', 'tap00fcd6f1-b1'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.735 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-1447a8cf-e94c-4250-b9a8-2b13c0cf60c6', 'ip', '-o', 'link', 'show', 'tap91076dbc-83'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-10-09 12:24:18.930 239392 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-14b396e1-d561-4087-990d-9b993cc08619', 'ip', '-o', 'link', 'show', 'tapdf767a28-af'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103

The following allows the agents to start:

- common_utils.wait_until_true(self._enable)
+ common_utils.wait_until_true(self._enable, timeout=300)

Few ways to solve this issue:
- Increase default timeout from 60s to a bigger number
- Make the timeout dhcp conf adjustable
- Figure out more optimal batch sizes of DHCP agents to be started at a time, and increase the startup performance

Revision history for this message
Brian Haley (brian-haley) wrote :

There was a somewhat related change proposed recently that could better handle a "bulk" operation in the dhcp-agent, https://review.opendev.org/#/c/682241/ - I wonder if that would help here?

tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

First of all, let me say that this problem is happening in Queens (but also reproducible in newer branches) and using PY2.

The patch introducing this regression is [1], because now we are checking with a timeout when the dnsmasq process is enabled (actually when the dnsmasq process finished the _enabled function correctly, because this won't check the status of the process, but this another issue).

In PY2, due to the amount of "root" commands executed (in Queens, the pyroute2 migration was starting) and the problem this Python version has with process.Popen() (15x slower than PY3), the DHCP start operation in a heavy loaded server takes more than this time. process.Popen() is used to communicate to rootwrap daemon to execute the "root" commands.

As a quick fix for this regression, we need to increase the timeout time as described in the bug description. This won't have the same impact when using PY3.

Regards.

[1] https://review.opendev.org/#/q/I18d73b787fa3ab8803e12d5e5eb2bb7109205aba

Revision history for this message
Olli Tuovinen (otuovinen) wrote :
Download full text (9.5 KiB)

Hello,

I managed to reproduce this in a test environment with 1000 dhcp enabled subnets.
Python 3.6 is faster (or at least chattier), but the same can still be seen.

Same 100-300ms exec delays with rootwrap visible.

RH with collections+pip for deps, so paths look interesting.

2019-10-28 15:51:02.510 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0ac39b4a-4517-466c-8857-54e4c9e8d416', 'ip', '-o', 'link', 'show', 'tapd1a110f8-06'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:02.807 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-01f0c21c-af3f-483f-b44c-77fa39862430', 'ip', '-o', 'link', 'show', 'tap04086413-55'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.088 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0ad0ae7b-3632-48be-9322-1177a82c17a6', 'ip', '-o', 'link', 'show', 'tapef523a14-5e'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.383 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-05063b4c-a030-4b92-af53-f47b19b64b9b', 'ip', '-o', 'link', 'show', 'tap9889d494-86'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.688 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-070c6f88-5000-4a01-a100-36c914a6e219', 'ip', '-o', 'link', 'show', 'tap0e6dc9db-49'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:03.981 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-04b2ae20-eecf-481b-bf63-a8809b454e97', 'ip', '-o', 'link', 'show', 'tapcc06a136-82'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.270 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0972f344-e157-48ec-a94d-25e5b63fc517', 'ip', '-o', 'link', 'show', 'tap2b239e1c-a7'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.555 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-010bf420-3113-4a5d-9598-7726777468f2', 'ip', 'link', 'set', 'tapde888fdc-0e', 'up'] execute_rootwrap_daemon /opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2019-10-28 15:51:04.857 680502 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-0840e054-f880-4404-a0e6-5c96a3088626', 'ip', '-o', 'link', 'show', 'tap49ac8943-28'] execute_rootwrap_daemon /opt/rh/rh-python36/r...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/692160

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/692376

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/692377

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/692378

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/692379

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/692160
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aedc09917680bc77501f4b57142e73768e3d06ad
Submitter: Zuul
Branch: master

commit aedc09917680bc77501f4b57142e73768e3d06ad
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/692376
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=44c838e25f8d85063cca52c9581c423539604f6c
Submitter: Zuul
Branch: stable/train

commit 44c838e25f8d85063cca52c9581c423539604f6c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/692378
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff0b7b00194d7ee38892c6c08de14f4da056ab2d
Submitter: Zuul
Branch: stable/rocky

commit ff0b7b00194d7ee38892c6c08de14f4da056ab2d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/692379
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dfbfa8ced3c96ec5e2649f47d8ac9f4d5b0e168e
Submitter: Zuul
Branch: stable/queens

commit dfbfa8ced3c96ec5e2649f47d8ac9f4d5b0e168e
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/692377
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1bf02756f4ae5a6afb75ae171bc7c092a03a7bb6
Submitter: Zuul
Branch: stable/stein

commit 1bf02756f4ae5a6afb75ae171bc7c092a03a7bb6
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Oct 30 14:35:37 2019 +0000

    Increase timeout when waiting for dnsmasq enablement

    In the reported bug, a regression was introduced in [1] when limiting
    the time to have a "dnsmasq" process enabled. It has been reported, as
    documented in the bug, that in older versions (Queens), using Python 2
    [2] and older versions of "ip_lib" (not implementing most of the
    commands using Pyroute2), that the time needed to spawn a "dnsmasq"
    process exceeds the default 60 seconds defined in
    "common_utils.wait_until_true".

    This patch increases this time by a reasonable 300 seconds.

    [1] https://review.opendev.org/#/c/643732
    [2] https://bugs.python.org/issue35757

    Change-Id: I2d8693145da72825876b951f2d10afe9ca28ff6e
    Closes-Bug: #1849676
    (cherry picked from commit aedc09917680bc77501f4b57142e73768e3d06ad)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.6

This issue was fixed in the openstack/neutron 13.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.4

This issue was fixed in the openstack/neutron 14.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.1

This issue was fixed in the openstack/neutron 15.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.