dhcp agent dnsmasq process mgmt race condition between launch and operations

Bug #1824802 reported by Brent Eagles on 2019-04-15
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski

Bug Description

There may be a race condition involving dnsmasq startup and port operations. What appears to happen is that dnsmasq is started but the pid file isn't available when a port change occurs. The dhcp agent then attempts to start a new dnsmasq instance even though the previous one is the process of being loaded. Tricky to reproduce manually but does seem to occur in tempest tests.

Note: this is currently being observed in TripleO ML2/OVS tests. As dnsmasq is run in a container with a well-defined-name so when the second container launch fails to because of a naming collision.

Brent Eagles (beagles) on 2019-04-15
summary: - dhcp agent dnsmasq race condition
+ dhcp agent dnsmasq process mgmt race condition between launch and
+ operations
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Slawek Kaplonski (slaweq) wrote :

I was today investigating logs from http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_934 once again.

It looks that dnsmasq process for network cbc2d3df-fcae-42b3-9d9b-248526a1a2f1 was first started properly at 14:18:34.803: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_34_803

Than some "Trigger reload_allocations for port" was logged at 14:18:36.890: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_890

That leads to reload of dnsmasq process which is done by sending SIGHUP. It was like that because external_process.ProcessManager.enable() was called but process was active so it called reload_cfg() method. See https://github.com/openstack/neutron/blob/master/neutron/agent/linux/external_process.py#L80
It happend at 14:18:36.895: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_895

But than, for some reason full sync was triggered and there was quickly send SIGKILL to the same process. It was at 14:18:38.436 : http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_38_436

and next attempt to start process at 14:18:38.883: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_38_883
And this one failed.

It looks for me very similar to bug https://bugs.launchpad.net/neutron/+bug/1811126 which was fixed recently by https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a

I'm not sure exactly what version of Neutron TripleO is using in this kind of job but can You maybe check if it was run with this patch or without it still?

Slawek Kaplonski (slaweq) wrote :

Please also see comment https://bugs.launchpad.net/neutron/+bug/1760047/comments/24 - it may be that patch https://review.openstack.org/626830 is triggering this issue now more often.

Changed in neutron:
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/658307
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5c433a027daab2419d974820589205253871885e
Submitter: Zuul
Branch: master

commit 5c433a027daab2419d974820589205253871885e
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/661012
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=55055172958bd278020d08cd80d7f7641a3f7520
Submitter: Zuul
Branch: stable/queens

commit 55055172958bd278020d08cd80d7f7641a3f7520
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802
    (cherry picked from commit 5c433a027daab2419d974820589205253871885e)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/661010
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=89e9c45e866190eb5526c51d14c234da7e2ec041
Submitter: Zuul
Branch: stable/stein

commit 89e9c45e866190eb5526c51d14c234da7e2ec041
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802
    (cherry picked from commit 5c433a027daab2419d974820589205253871885e)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers