dhcp agent dnsmasq process mgmt race condition between launch and operations

Bug #1824802 reported by Brent Eagles
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

There may be a race condition involving dnsmasq startup and port operations. What appears to happen is that dnsmasq is started but the pid file isn't available when a port change occurs. The dhcp agent then attempts to start a new dnsmasq instance even though the previous one is the process of being loaded. Tricky to reproduce manually but does seem to occur in tempest tests.

Note: this is currently being observed in TripleO ML2/OVS tests. As dnsmasq is run in a container with a well-defined-name so when the second container launch fails to because of a naming collision.

Brent Eagles (beagles)
summary: - dhcp agent dnsmasq race condition
+ dhcp agent dnsmasq process mgmt race condition between launch and
+ operations
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I was today investigating logs from http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_934 once again.

It looks that dnsmasq process for network cbc2d3df-fcae-42b3-9d9b-248526a1a2f1 was first started properly at 14:18:34.803: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_34_803

Than some "Trigger reload_allocations for port" was logged at 14:18:36.890: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_890

That leads to reload of dnsmasq process which is done by sending SIGHUP. It was like that because external_process.ProcessManager.enable() was called but process was active so it called reload_cfg() method. See https://github.com/openstack/neutron/blob/master/neutron/agent/linux/external_process.py#L80
It happend at 14:18:36.895: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_36_895

But than, for some reason full sync was triggered and there was quickly send SIGKILL to the same process. It was at 14:18:38.436 : http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_38_436

and next attempt to start process at 14:18:38.883: http://logs.openstack.org/97/631497/7/check/tripleo-ci-centos-7-scenario007-standalone/42068d9/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz#_2019-04-10_14_18_38_883
And this one failed.

It looks for me very similar to bug https://bugs.launchpad.net/neutron/+bug/1811126 which was fixed recently by https://github.com/openstack/neutron/commit/157e09e6af758b7669fbe5a8cdb0b1969f04661a

I'm not sure exactly what version of Neutron TripleO is using in this kind of job but can You maybe check if it was run with this patch or without it still?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Please also see comment https://bugs.launchpad.net/neutron/+bug/1760047/comments/24 - it may be that patch https://review.openstack.org/626830 is triggering this issue now more often.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/658307

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/658307
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5c433a027daab2419d974820589205253871885e
Submitter: Zuul
Branch: master

commit 5c433a027daab2419d974820589205253871885e
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/661010

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/661011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/661012

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/661012
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=55055172958bd278020d08cd80d7f7641a3f7520
Submitter: Zuul
Branch: stable/queens

commit 55055172958bd278020d08cd80d7f7641a3f7520
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802
    (cherry picked from commit 5c433a027daab2419d974820589205253871885e)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/661010
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=89e9c45e866190eb5526c51d14c234da7e2ec041
Submitter: Zuul
Branch: stable/stein

commit 89e9c45e866190eb5526c51d14c234da7e2ec041
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802
    (cherry picked from commit 5c433a027daab2419d974820589205253871885e)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/661011
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=946faaf361003a0a67081be28565d3815c354510
Submitter: Zuul
Branch: stable/rocky

commit 946faaf361003a0a67081be28565d3815c354510
Author: Slawek Kaplonski <email address hidden>
Date: Fri May 10 10:47:00 2019 +0200

    [DHCP] Don't resync network if same port is alredy in cache

    Sometimes when port is created on dhcp agent's side, it may happend
    that same port is already in network cache.
    Before this patch if port with same IP address was already in cache,
    resync was rescheduled because of duplicate IPs found in cache.

    Now resync will be scheduled only if duplicate IP address belongs to
    port with different MAC address or different id.

    Change-Id: I23afbc10725f5dc78e3c63e6e505ef89ba8dc4a5
    Closes-Bug: #1824802
    (cherry picked from commit 5c433a027daab2419d974820589205253871885e)

tags: added: in-stable-rocky
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.4

This issue was fixed in the openstack/neutron 13.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.2

This issue was fixed in the openstack/neutron 14.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.1.0

This issue was fixed in the openstack/neutron 12.1.0 release.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.