Sometimes VMs can't get IP when spawned concurrently
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Oleg Bondarev |
Bug Description
Version: Stein
Scenario description:
Rally creates 60 VMs with 6 threads. Each thread:
- creates a VM
- pings it
- if successful ping, tries to reach the VM via ssh and execute a command. It tries to do that during 2 minutes.
- if successful ssh - deletes the VM
For some VMs ping fails. Console log shows that VM failed to get IP from DHCP.
tcpdump on corresponding DHCP port shows VM's DHCP requests, but dnsmasq does not reply.
From dnsmasq logs:
Feb 6 00:15:43 dnsmasq[4175]: read /var/lib/
Feb 6 00:15:43 dnsmasq[4175]: duplicate dhcp-host IP address 10.2.0.194 at line 28 of /var/lib/
...
Feb 6 00:15:48 dnsmasq-dhcp[4175]: 1436802562 DHCPDISCOVER(
So it must be something wrong with neutron-dhcp-agent network cache.
From neutron-dhcp-agent log:
2020-02-06 00:15:20.282 40 DEBUG neutron.
2020-02-06 00:15:20.282 40 DEBUG neutron.
2020-02-06 00:15:20.283 40 DEBUG neutron.
so the agent is aware of invalid cache for the net, but for unknown reason actual net resync happens only in 8 minutes:
2020-02-06 00:23:55.297 40 INFO neutron.
description: | updated |
tags: | added: neutron-proactive-backport-potential |
tags: | removed: neutron-proactive-backport-potential |
dhcp agent's sync_state() function is supposed to get a write lock and proceed, and seems it waits to get the lock for a long time because of many regular update events coming to the agent's queue (due to many concurrent port create/ update/ delete operations).
These update events are taking read locks.
Write lock is supposed to be of higher priority than read lock, but for some reason it doesn't work.