dhcp agent dnsmasq process mgmt race condition between launch and operations
Bug #1824802 reported by
Brent Eagles
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Slawek Kaplonski |
Bug Description
There may be a race condition involving dnsmasq startup and port operations. What appears to happen is that dnsmasq is started but the pid file isn't available when a port change occurs. The dhcp agent then attempts to start a new dnsmasq instance even though the previous one is the process of being loaded. Tricky to reproduce manually but does seem to occur in tempest tests.
Note: this is currently being observed in TripleO ML2/OVS tests. As dnsmasq is run in a container with a well-defined-name so when the second container launch fails to because of a naming collision.
summary: |
- dhcp agent dnsmasq race condition + dhcp agent dnsmasq process mgmt race condition between launch and + operations |
Changed in neutron: | |
importance: | Undecided → High |
status: | New → Confirmed |
Changed in neutron: | |
assignee: | nobody → Slawek Kaplonski (slaweq) |
Changed in neutron: | |
status: | Confirmed → In Progress |
tags: | added: neutron-proactive-backport-potential |
tags: | removed: neutron-proactive-backport-potential |
To post a comment you must log in.
I was today investigating logs from http:// logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_36_934 once again.
It looks that dnsmasq process for network cbc2d3df- fcae-42b3- 9d9b-248526a1a2 f1 was first started properly at 14:18:34.803: http:// logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_34_803
Than some "Trigger reload_allocations for port" was logged at 14:18:36.890: http:// logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_36_890
That leads to reload of dnsmasq process which is done by sending SIGHUP. It was like that because external_ process. ProcessManager. enable( ) was called but process was active so it called reload_cfg() method. See https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ agent/linux/ external_ process. py#L80 logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_36_895
It happend at 14:18:36.895: http://
But than, for some reason full sync was triggered and there was quickly send SIGKILL to the same process. It was at 14:18:38.436 : http:// logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_38_436
and next attempt to start process at 14:18:38.883: http:// logs.openstack. org/97/ 631497/ 7/check/ tripleo- ci-centos- 7-scenario007- standalone/ 42068d9/ logs/undercloud /var/log/ containers/ neutron/ dhcp-agent. log.txt. gz#_2019- 04-10_14_ 18_38_883
And this one failed.
It looks for me very similar to bug https:/ /bugs.launchpad .net/neutron/ +bug/1811126 which was fixed recently by https:/ /github. com/openstack/ neutron/ commit/ 157e09e6af758b7 669fbe5a8cdb0b1 969f04661a
I'm not sure exactly what version of Neutron TripleO is using in this kind of job but can You maybe check if it was run with this patch or without it still?