tripleo

neutron dhcp agent sometimes fails when restarting dnsmasq sidecar w/podman

Bug #1823377 reported by Brent Eagles on 2019-04-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Won't Fix	High	Unassigned	tripleo stein-rc1

Bug Description

Seems the permissions of the pidfiles is messed. From the dhcp agent logs:

2019-04-05 13:57:41.908 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-05 13:57:41.909 41253 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-6445c6c0-76a4-46da-95df-b58aafca742e', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/host', '--addn-hosts=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-interfaces', '--dhcp-range=set:tag0,192.168.40.0,static,255.255.255.0,86400s', '--dhcp-option-force=option:mtu,1450', '--dhcp-lease-max=256', '--conf-file=', '--domain=localdomain'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-04-05 13:57:43.130 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-05 13:57:43.131 41253 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

Tags:

Brent Eagles (beagles) on 2019-04-05

Changed in tripleo:
milestone:	none → stein-rc1
status:	New → Triaged

Emilien Macchi (emilienm) on 2019-04-08

Changed in tripleo:
importance:	Undecided → High

Emilien Macchi (emilienm) on 2019-04-08

tags:

added: stein-backport-potential

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-08:

This also affects all UC Stein deployments http://logs.openstack.org/59/648959/2/check/tripleo-ci-centos-7-undercloud-containers/7bfda2c/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-02_12_56_28_887

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-08:

I'm sorry, was that already Train job deploying that UC? Anyway, I believe the issue is as of stein.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-09:

Confirmed with a standalone deployed with docker CLI from current-tripleo: http://paste.openstack.org/show/37u6IjtoFfzZHwguBmhq/

pid files are owned by root, so perhaps that is the cause those cannot be accessed by neutron

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-09:

See how the wrapper containers got restarted (not immediately!) upon restarting the main dhcp agent container http://paste.openstack.org/show/ZvfLhVtPOmPu6lkMwd8W/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-09:

See how file stats changes upon restarting containers http://paste.openstack.org/show/UgFCuReb5FEjlw7kTDvU/

(disregard lines 92++, there is no file paths mismatching, just owners)

Revision history for this message

Brent Eagles (beagles) wrote on 2019-04-09:

#10

From what I can tell so far is that the restarting of the sidecar is a deliberate thing by neutron. If it is doing something new and unexpected with dnsmasq it might also be causing some kind of race condition that would explain some CI failures we are seeing where the dnsmasq container is already running and the agent tries to start it again.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-10:

#11

For the wrappers side, we should use rootwrap and avoid running containers as root user. That should resolve the permissions issue and fix the bug at least for the part of it that causes unexpected restarting of sidecars caused by boring access errors and not tricky race conditions in Neutron.

Changed in tripleo:
status:	Triaged → In Progress
assignee:	nobody → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-10: Fix proposed to puppet-tripleo (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/651505

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-10: Related fix proposed to tripleo-heat-templates (master)

#13

Related fix proposed to branch: master
Review: https://review.openstack.org/651524

Revision history for this message

Brent Eagles (beagles) wrote on 2019-04-11: Re: restarting the neutron dhcp agent container restarts dhcp sidecars

#14

Data point: The dhcp agent restarts dnsmasq on agent restart in rocky as well and there are no permissions issues, so that in itself does not appear to be the problem.

summary:

- restarting the neutron dhcp agent container restarts dhcp sidecars
+ neutron dhcp agent sometimes fails when restarting dnsmasq sidecar
+ w/podman

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-11: Change abandoned on tripleo-heat-templates (master)

#15

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/651524

Revision history for this message

Brent Eagles (beagles) wrote on 2019-04-12:

#16

tl;dr it looks to me like neutron is simply not waiting long enough for dnsmasq to be running/write its pid.

Observations:
* Lines such as
2019-04-11 13:20:35.538 503738 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/6445c6c0-76a4-46da-95df-b58aafca742e.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

are probably bogus. force_metadata is false in my standalone environment so I don't know why it's checking for an haproxy pid here. The haproxy instance belongs to the l3 agent

* "Unable to access" under the /var/lib/neutron/dhcp *might* be valid if it is a new network (I need to double check this).

* With debug enabled we can see that the dhcp agent is able to access the pid file otherwise it would not be able to kill the dnsmasq process. e.g.

2019-04-11 12:40:34.753 466422 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['kill', '-9', '130075'] execute_rootwrap_daemon /usr/li
b/python2.7/site-packages/neutron/agent/linux/utils.py:103

What I find misleading is that it is immediatedly followed by:
2019-04-11 12:40:35.191 466422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261
2019-04-11 12:40:35.193 466422 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/6445c6c0-76a4-46da-95df-b58aafca742e/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:261

Eventually things settle down, so I'm left wondering if a lot of this is just noise.

* Looking at working systems on rocky it looks like there are times when dnsmasq is started/killed and started again - which the sidecars would do if it found an unexpected container running.

Bogdan Dobrelya (bogdando) on 2019-04-12

Changed in tripleo:
status:	In Progress → Triaged
assignee:	Bogdan Dobrelya (bogdando) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Change abandoned on puppet-tripleo (master)

#17

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/651505
Reason: With debug enabled we can see that the dhcp agent is able to access
the pid file otherwise it would not be able to kill the dnsmasq process.
e.g.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-12:

#18

@Brent, perhaps we need another bug for unexpected restarts of dnsmasq et al wrappers, when restarting its associated agent containers. The latter still remains the issue that can easily be reproduced.

Revision history for this message

Brent Eagles (beagles) wrote on 2019-04-15:

#19

@Bogdan - it appears that neutron restarting the dnsmasq container "on purpose" so if it is a bug - it isn't ours.

I've filed these bugs:

One for the apparent confusing log spamming:
https://bugs.launchpad.net/neutron/+bug/1824799 neutron dhcp agent looks for external pid unexpectedly

One for the behavior we are seeing in the scenario 7 logs (dhcp agent is relaunching a dnsmasq instance even though there is one already loaded):
https://bugs.launchpad.net/neutron/+bug/1824802 dhcp agent dnsmasq process mgmt race condition between launch and operations

In the latter bug, the logs seem to indicate that neutron makes 3 attempts to run dnsmasq - in one case seemingly 1 second apart. It could be that we are adding a bit of extra "lag" or async behavior because of the container launch but I'm not sure how to guard against that ... unless we put some kind of sleep or polling on the container status at the end of the wrapper and hope for the best (ugh)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-04-15:

#20

Thank you Brent, so then we have nothing to fix here in tripleo, right?

Changed in tripleo:
status:	Triaged → Won't Fix

Revision history for this message

Brent Eagles (beagles) wrote on 2019-04-15:

#21

@Bogdan - correct!

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.