Sometimes VMs can't get IP when spawned concurrently

Bug #1862315 reported by Oleg Bondarev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Oleg Bondarev

Bug Description

Version: Stein
Scenario description:
Rally creates 60 VMs with 6 threads. Each thread:
 - creates a VM
 - pings it
 - if successful ping, tries to reach the VM via ssh and execute a command. It tries to do that during 2 minutes.
 - if successful ssh - deletes the VM

For some VMs ping fails. Console log shows that VM failed to get IP from DHCP.

tcpdump on corresponding DHCP port shows VM's DHCP requests, but dnsmasq does not reply.
From dnsmasq logs:

Feb 6 00:15:43 dnsmasq[4175]: read /var/lib/neutron/dhcp/da73026e-09b9-4f8d-bbdd-84d89c2487b2/addn_hosts - 28 addresses
Feb 6 00:15:43 dnsmasq[4175]: duplicate dhcp-host IP address 10.2.0.194 at line 28 of /var/lib/neutron/dhcp/da73026e-09b9-4f8d-bbdd-84d89c2487b2/host
...
Feb 6 00:15:48 dnsmasq-dhcp[4175]: 1436802562 DHCPDISCOVER(tap7216a777-13) fa:16:3e:b1:a7:f2 no address available

So it must be something wrong with neutron-dhcp-agent network cache.

From neutron-dhcp-agent log:

2020-02-06 00:15:20.282 40 DEBUG neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Resync event has been scheduled _periodic_resync_helper /var/lib/openstack/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:276
2020-02-06 00:15:20.282 40 DEBUG neutron.common.utils [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Calling throttled function clear wrapper /var/lib/openstack/lib/python3.6/site-packages/neutron/common/utils.py:102
2020-02-06 00:15:20.283 40 DEBUG neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] resync (da73026e-09b9-4f8d-bbdd-84d89c2487b2): ['Duplicate IP addresses found, DHCP cache is out of sync'] _periodic_resync_helper /var/lib/openstack/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:293

so the agent is aware of invalid cache for the net, but for unknown reason actual net resync happens only in 8 minutes:

2020-02-06 00:23:55.297 40 INFO neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Synchronizing state

Revision history for this message
Oleg Bondarev (obondarev) wrote :

dhcp agent's sync_state() function is supposed to get a write lock and proceed, and seems it waits to get the lock for a long time because of many regular update events coming to the agent's queue (due to many concurrent port create/update/delete operations).
These update events are taking read locks.
Write lock is supposed to be of higher priority than read lock, but for some reason it doesn't work.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Looking at ReaderWriterLock from fasteners lib, read_lock() function (which is eventually used by dhcp agent):
https://github.com/harlowja/fasteners/blob/0.15/fasteners/lock.py#L160-L202

the func is supposed to "wait until no active or pending writers", but from code it looks it only checks that current thread is not in pending writers and doesn't actually check that there are no other pending writers. I might be missing something, so going to contact @harlowja.

description: updated
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Also commit https://review.opendev.org/#/c/626830/ introduced priorities for port notifications so that port delete has lower priority than port create - this leads to "Duplicate IP addresses" when IP is reused quickly. Why not same priority for port events (still leaving PORT_CREATE_HIGH)?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/707077

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/707077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a0bb5763b277d7360554fd64813fdce01244d2fa
Submitter: Zuul
Branch: master

commit a0bb5763b277d7360554fd64813fdce01244d2fa
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/708122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/708123

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/708124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/708125

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/708125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=de29b49b132b84f07329780958b503edbd5c77f6
Submitter: Zuul
Branch: stable/queens

commit de29b49b132b84f07329780958b503edbd5c77f6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/708124
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Submitter: Zuul
Branch: stable/rocky

commit 9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/708122
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Submitter: Zuul
Branch: stable/train

commit e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/708123
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Submitter: Zuul
Branch: stable/stein

commit 5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-stein
tags: added: neutron-proactive-backport-potential
Dan Radez (dradez)
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/807134

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/807134
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Brian Haley (brian-haley) wrote :

Is this still seen? Can it be closed?

Revision history for this message
Brian Haley (brian-haley) wrote :

Since all the changes seemed to have merge will close this.

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.