Sometimes VMs can't get IP when spawned concurrently

Bug #1862315 reported by Oleg Bondarev on 2020-02-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Oleg Bondarev

Bug Description

Version: Stein
Scenario description:
Rally creates 60 VMs with 6 threads. Each thread:
 - creates a VM
 - pings it
 - if successful ping, tries to reach the VM via ssh and execute a command. It tries to do that during 2 minutes.
 - if successful ssh - deletes the VM

For some VMs ping fails. Console log shows that VM failed to get IP from DHCP.

tcpdump on corresponding DHCP port shows VM's DHCP requests, but dnsmasq does not reply.
From dnsmasq logs:

Feb 6 00:15:43 dnsmasq[4175]: read /var/lib/neutron/dhcp/da73026e-09b9-4f8d-bbdd-84d89c2487b2/addn_hosts - 28 addresses
Feb 6 00:15:43 dnsmasq[4175]: duplicate dhcp-host IP address 10.2.0.194 at line 28 of /var/lib/neutron/dhcp/da73026e-09b9-4f8d-bbdd-84d89c2487b2/host
...
Feb 6 00:15:48 dnsmasq-dhcp[4175]: 1436802562 DHCPDISCOVER(tap7216a777-13) fa:16:3e:b1:a7:f2 no address available

So it must be something wrong with neutron-dhcp-agent network cache.

From neutron-dhcp-agent log:

2020-02-06 00:15:20.282 40 DEBUG neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Resync event has been scheduled _periodic_resync_helper /var/lib/openstack/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:276
2020-02-06 00:15:20.282 40 DEBUG neutron.common.utils [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Calling throttled function clear wrapper /var/lib/openstack/lib/python3.6/site-packages/neutron/common/utils.py:102
2020-02-06 00:15:20.283 40 DEBUG neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] resync (da73026e-09b9-4f8d-bbdd-84d89c2487b2): ['Duplicate IP addresses found, DHCP cache is out of sync'] _periodic_resync_helper /var/lib/openstack/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py:293

so the agent is aware of invalid cache for the net, but for unknown reason actual net resync happens only in 8 minutes:

2020-02-06 00:23:55.297 40 INFO neutron.agent.dhcp.agent [req-f5107bdd-d53a-4171-a283-de3d7cf7c708 - - - - -] Synchronizing state

Oleg Bondarev (obondarev) wrote :

dhcp agent's sync_state() function is supposed to get a write lock and proceed, and seems it waits to get the lock for a long time because of many regular update events coming to the agent's queue (due to many concurrent port create/update/delete operations).
These update events are taking read locks.
Write lock is supposed to be of higher priority than read lock, but for some reason it doesn't work.

Oleg Bondarev (obondarev) wrote :

Looking at ReaderWriterLock from fasteners lib, read_lock() function (which is eventually used by dhcp agent):
https://github.com/harlowja/fasteners/blob/0.15/fasteners/lock.py#L160-L202

the func is supposed to "wait until no active or pending writers", but from code it looks it only checks that current thread is not in pending writers and doesn't actually check that there are no other pending writers. I might be missing something, so going to contact @harlowja.

description: updated
Oleg Bondarev (obondarev) wrote :

Also commit https://review.opendev.org/#/c/626830/ introduced priorities for port notifications so that port delete has lower priority than port create - this leads to "Duplicate IP addresses" when IP is reused quickly. Why not same priority for port events (still leaving PORT_CREATE_HIGH)?

Reviewed: https://review.opendev.org/707077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a0bb5763b277d7360554fd64813fdce01244d2fa
Submitter: Zuul
Branch: master

commit a0bb5763b277d7360554fd64813fdce01244d2fa
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423

Reviewed: https://review.opendev.org/708125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=de29b49b132b84f07329780958b503edbd5c77f6
Submitter: Zuul
Branch: stable/queens

commit de29b49b132b84f07329780958b503edbd5c77f6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/708124
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Submitter: Zuul
Branch: stable/rocky

commit 9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/708122
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Submitter: Zuul
Branch: stable/train

commit e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/708123
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Submitter: Zuul
Branch: stable/stein

commit 5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-stein
tags: added: neutron-proactive-backport-potential
Dan Radez (dradez) on 2020-09-01
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers