DHCP agent timing out spawning dnsmasq

Bug #1848187 reported by Will Szumski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Will Szumski

Bug Description

This is using `stable/stein` deployed with kolla. I'm seeing timeouts when spawning the DHCP process which seems to result in an inconsistent state.

  2019-10-14 18:33:04.896 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'sysctl', '-w', 'net.ipv4.conf.all.promote_secondaries=1'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:08.981 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'sysctl', '-w', 'net.ipv4.ip_nonlocal_bind=1'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:13.924 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'sysctl', '-w', 'net.ipv6.conf.default.accept_ra=0'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:35.364 6 DEBUG oslo_concurrency.lockutils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Acquired lock "iptables-qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:265
  2019-10-14 18:33:35.366 6 DEBUG oslo_concurrency.lockutils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Acquired external semaphore "iptables-qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:272
  2019-10-14 18:33:35.366 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'iptables-save'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:39.431 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'iptables-restore', '-n', '-w', '10', '-W', '200000'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:48.335 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'ip6tables-save'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:50.865 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'ip6tables-restore', '-n', '-w', '10', '-W', '200000'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:52.677 6 DEBUG neutron.agent.linux.iptables_manager [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] IPTablesManager.apply completed with success. 94 iptables commands were issued _apply_synchronized /usr/lib/python2.7/site-packages/neutron/agent/linux/iptables_manager.py:620
  2019-10-14 18:33:52.678 6 DEBUG oslo_concurrency.lockutils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Releasing lock "iptables-qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:281
  2019-10-14 18:33:56.488 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'ip', '-4', 'route', 'list', 'dev', 'tapc29a1605-1a'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:33:59.880 6 DEBUG neutron.agent.linux.dhcp [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Setting IPv4 gateway for dhcp netns on net 432513ea-ddb7-4400-af94-b0017bb50df7 to 10.100.0.1 _set_default_route_ip_version /usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py:1262
  2019-10-14 18:33:59.881 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'ip', '-4', 'route', 'replace', 'default', 'via', '10.100.0.1', 'dev', 'tapc29a1605-1a'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:34:01.536 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'ip', '-6', 'route', 'list', 'dev', 'tapc29a1605-1a'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:34:02.369 6 ERROR neutron.agent.linux.dhcp [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Failed to start DHCP process for network 432513ea-ddb7-4400-af94-b0017bb50df7: WaitTimeout: Timed out after 60 seconds
  2019-10-14 18:34:02.370 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Unable to access /var/lib/neutron/external/pids/432513ea-ddb7-4400-af94-b0017bb50df7.pid.haproxy get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
  2019-10-14 18:34:02.370 6 DEBUG neutron.agent.metadata.driver [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] haproxy_cfg =
  2019-10-14 18:34:02.371 6 DEBUG neutron.agent.linux.utils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-432513ea-ddb7-4400-af94-b0017bb50df7', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/432513ea-ddb7-4400-af94-b0017bb50df7.conf'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:87
  2019-10-14 18:34:04.003 6 DEBUG oslo_concurrency.lockutils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Lock "resize_greenpool" acquired by "neutron.agent.dhcp.agent._resize_process_pool" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:327
  2019-10-14 18:34:04.003 6 DEBUG oslo_concurrency.lockutils [req-14bec9b8-4264-401d-bcfd-fb588458a0a8 - - - - -] Lock "resize_greenpool" released by "neutron.agent.dhcp.agent._resize_process_pool" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339
  clush: centos@bm-controller[01-02] (2): exited with exit code 1'

Notice that the calls to setup the network namespace are taking a longer than 60 seconds. It appears we never get past: https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L228, to actually spawn the process and that no retries are attemped.

Tags: l3-ipam-dhcp
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688693

Changed in neutron:
assignee: nobody → Will Szumski (willjs)
status: New → In Progress
Hongbin Lu (hongbin.lu)
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium
summary: - openvswitch agent timing out spawning dnsmasq
+ DHCP agent timing out spawning dnsmasq
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

It sounds for me like issue with many opened files in rootwrap daemon.
Please check old patches in TripleO heat templates which were proposed as workaround to similar problem (IIRC): https://review.opendev.org/#/c/560991/ - this one set ulimit in containers to 1024 but it appeared that it's too low, so patch https://review.opendev.org/#/c/651137/ increased it to 16384.
Can You check how this is set in Your containers?

Revision history for this message
Will Szumski (willjs) wrote :

ulimit in neutron_dhcp_agent container:

(neutron-dhcp-agent)[neutron@bm-controller01 /]$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1029556
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Will Szumski (<email address hidden>) on branch: master
Review: https://review.opendev.org/688693

Revision history for this message
Will Szumski (willjs) wrote :

I've made a kolla-ansible bug for the rootwrap slowness:

https://bugs.launchpad.net/kolla-ansible/+bug/1848737

I still think this represents a bug as the failure doesn't seem to be handled gracefully i.e why don't we try an spawn dnsmasq on the second request. It seems to get into a broken state.

Revision history for this message
Brian Haley (brian-haley) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.