neutron dhcp agent sometimes fails when restarting dnsmasq sidecar w/podman

Bug #1823377 reported by Brent Eagles
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
High
Unassigned

Bug Description

Seems the permissions of the pidfiles is messed. From the dhcp agent logs:

2019-04-05 13:57:41.908 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-05 13:57:41.909 41253 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-6445c6c0-76a4-46da-95df-b58aafca742e', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/host', '--addn-hosts=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-interfaces', '--dhcp-range=set:tag0,192.168.40.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1450', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-04-05 13:57:43.130 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-05 13:57:43.131 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

Brent Eagles (beagles)
Changed in tripleo:
milestone: none → stein-rc1
status: New → Triaged
Changed in tripleo:
importance: Undecided → High
tags: added: stein-backport-potential
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm sorry, was that already Train job deploying that UC? Anyway, I believe the issue is as of stein.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Confirmed with a standalone deployed with docker CLI from current-tripleo: http://paste.openstack.org/show/37u6IjtoFfzZHwguBmhq/

pid files are owned by root, so perhaps that is the cause those cannot be accessed by neutron

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

See how the wrapper containers got restarted (not immediately!) upon restarting the main dhcp agent container http://paste.openstack.org/show/ZvfLhVtPOmPu6lkMwd8W/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

See how file stats changes upon restarting containers http://paste.openstack.org/show/UgFCuReb5FEjlw7kTDvU/

(disregard lines 92++, there is no file paths mismatching, just owners)

Revision history for this message
Brent Eagles (beagles) wrote :

From what I can tell so far is that the restarting of the sidecar is a deliberate thing by neutron. If it is doing something new and unexpected with dnsmasq it might also be causing some kind of race condition that would explain some CI failures we are seeing where the dnsmasq container is already running and the agent tries to start it again.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For the wrappers side, we should use rootwrap and avoid running containers as root user. That should resolve the permissions issue and fix the bug at least for the part of it that causes unexpected restarting of sidecars caused by boring access errors and not tricky race conditions in Neutron.

Changed in tripleo:
status: Triaged → In Progress
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651505

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/651524

Revision history for this message
Brent Eagles (beagles) wrote : Re: restarting the neutron dhcp agent container restarts dhcp sidecars

Data point: The dhcp agent restarts dnsmasq on agent restart in rocky as well and there are no permissions issues, so that in itself does not appear to be the problem.

summary: - restarting the neutron dhcp agent container restarts dhcp sidecars
+ neutron dhcp agent sometimes fails when restarting dnsmasq sidecar
+ w/podman
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/651524

Revision history for this message
Brent Eagles (beagles) wrote :

tl;dr it looks to me like neutron is simply not waiting long enough for dnsmasq to be running/write its pid.

Observations:
* Lines such as
2019-04-11 13:20:35.538 503738 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

are probably bogus. force_metadata is false in my standalone environment so I don't know why it's checking for an haproxy pid here. The haproxy instance belongs to the l3 agent

* "Unable to access" under the /var/lib/neutron/dhcp *might* be valid if it is a new network (I need to double check this).

* With debug enabled we can see that the dhcp agent is able to access the pid file otherwise it would not be able to kill the dnsmasq process. e.g.

2019-04-11 12:40:34.753 466422 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['kill', '-9', '130075'] execute_rootwrap_daemon /usr/li
b/python2.7/site-packages/neutron/agent/linux/utils.py:103

What I find misleading is that it is immediatedly followed by:
2019-04-11 12:40:35.191 466422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-11 12:40:35.193 466422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

Eventually things settle down, so I'm left wondering if a lot of this is just noise.

* Looking at working systems on rocky it looks like there are times when dnsmasq is started/killed and started again - which the sidecars would do if it found an unexpected container running.

Changed in tripleo:
status: In Progress → Triaged
assignee: Bogdan Dobrelya (bogdando) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/651505
Reason: With debug enabled we can see that the dhcp agent is able to access
the pid file otherwise it would not be able to kill the dnsmasq process.
e.g.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Brent, perhaps we need another bug for unexpected restarts of dnsmasq et al wrappers, when restarting its associated agent containers. The latter still remains the issue that can easily be reproduced.

Revision history for this message
Brent Eagles (beagles) wrote :

@Bogdan - it appears that neutron restarting the dnsmasq container "on purpose" so if it is a bug - it isn't ours.

I've filed these bugs:

One for the apparent confusing log spamming:
https://bugs.launchpad.net/neutron/+bug/1824799 neutron dhcp agent looks for external pid unexpectedly

One for the behavior we are seeing in the scenario 7 logs (dhcp agent is relaunching a dnsmasq instance even though there is one already loaded):
https://bugs.launchpad.net/neutron/+bug/1824802 dhcp agent dnsmasq process mgmt race condition between launch and operations

In the latter bug, the logs seem to indicate that neutron makes 3 attempts to run dnsmasq - in one case seemingly 1 second apart. It could be that we are adding a bit of extra "lag" or async behavior because of the container launch but I'm not sure how to guard against that ... unless we put some kind of sleep or polling on the container status at the end of the wrapper and hope for the best (ugh)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Thank you Brent, so then we have nothing to fix here in tripleo, right?

Changed in tripleo:
status: Triaged → Won't Fix
Revision history for this message
Brent Eagles (beagles) wrote :

@Bogdan - correct!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.