Very often in DHCP agent we have ""Duplicate IP addresses found, DHCP cache is out of sync"

Bug #1828423 reported by Slawek Kaplonski on 2019-05-09
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Slawek Kaplonski

Bug Description

As described in comment https://bugs.launchpad.net/neutron/+bug/1760047/comments/24 it looks that after patch https://review.opendev.org/#/c/626830/ we have a lot of messages like "Duplicate IP addresses found, DHCP cache is out of sync" in neutron dhcp agent's logs.

To reproduce this it is enough to create new network, subnet and port in this network. It will then be allocated to dhcp agent and in logs there will be such warning.

Changed in neutron:
importance: Undecided → Low
Download full text (4.4 KiB)

Hi:

I tried to (in this order):
- Create a network
- Create a subnet

I've tested a bit adding more log information. I've seen that, when the event "network create" is processed [1], tries to retrieve the network information, NOT from the cache but from the RPC client [2]. Then in [3], this new information (network, subnet and port) is added to the cache (debug log added at the beginning of [4]):

WARNING neutron.agent.dhcp.agent [-] RAH - put network: _ns_name=qdhcp-a46c2769-fd1c-49f5-889f-6f019b727a48, admin_state_up=True, availability_zone_hints=[], availability_zones=['nova'], created_at=2019-05-20T11:20:38Z, description=, id=a46c2769-fd1c-49f5-889f-6f019b727a48, ipv4_address_scope=None, ipv6_address_scope=None, mtu=1500, name=net9, non_local_subnets=[], port_security_enabled=True, ports=[{'id': '3ba19af8-26fd-481e-bc60-2e294c74c7d4', 'name': '', 'network_id': 'a46c2769-fd1c-49f5-889f-6f019b727a48', 'tenant_id': '159b38ee91dc46a59a40536b26cc4233', 'mac_address': 'fa:16:3e:07:1f:a8', 'admin_state_up': True, 'status': 'ACTIVE', 'fixed_ips': [{'subnet_id': '38695b5e-ad25-4993-a851-8e1e18b053cf', 'ip_address': '10.103.0.2', 'subnet': {'id': '38695b5e-ad25-4993-a851-8e1e18b053cf', 'name': 'subnet3_a', 'tenant_id': '159b38ee91dc46a59a40536b26cc4233', 'network_id': 'a46c2769-fd1c-49f5-889f-6f019b727a48', 'ip_version': 4, 'subnetpool_id': None, 'enable_dhcp': True, 'ipv6_ra_mode': None, 'ipv6_address_mode': None, 'gateway_ip': '10.103.0.1', 'cidr': '10.103.0.0/24', 'allocation_pools': [{'start': '10.103.0.2', 'end': '10.103.0.254'}], 'host_routes': [], 'dns_nameservers': [], 'shared': False, 'description': '', 'service_types': [], 'tags': [], 'created_at': '2019-05-20T11:20:45Z', 'updated_at': '2019-05-20T11:20:45Z', 'revision_number': 0, 'project_id': '159b38ee91dc46a59a40536b26cc4233'}}], 'device_id': 'dhcp95cf1ac9-7fbf-5be6-bc93-e3471f4e5b33-a46c2769-fd1c-49f5-889f-6f019b727a48', 'device_owner': 'network:dhcp', 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'security_groups': [], 'description': '', 'binding:vnic_type': 'normal', 'binding:profile': {}, 'binding:host_id': 'osdev18', 'binding:vif_type': 'ovs', 'binding:vif_details': {'port_filter': True, 'ovs_hybrid_plug': False, 'datapath_type': 'system', 'bridge_name': 'br-int'}, 'port_security_enabled': False, 'qos_policy_id': None, 'resource_request': None, 'tags': [], 'created_at': '2019-05-20T11:20:46Z', 'updated_at': '2019-05-20T11:20:49Z', 'revision_number': 4, 'project_id': '159b38ee91dc46a59a40536b26cc4233'}], project_id=159b38ee91dc46a59a40536b26cc4233, provider:network_type=vlan, provider:physical_network=default, provider:segmentation_id=1072, qos_policy_id=None, revision_number=3, router:external=False, shared=False, status=ACTIVE, subnets=[{'id': '38695b5e-ad25-4993-a851-8e1e18b053cf', 'name': 'subnet3_a', 'tenant_id': '159b38ee91dc46a59a40536b26cc4233', 'network_id': 'a46c2769-fd1c-49f5-889f-6f019b727a48', 'ip_version': 4, 'subnetpool_id': None, 'enable_dhcp': True, 'ipv6_ra_mode': None, 'ipv6_address_mode': None, 'gateway_ip': '10.103.0.1', 'cidr': '10.103.0.0/24', 'allocation_pools': [{'start': '10.103.0.2', 'end': '10.103.0.254'}], 'host_routes': [], 'dns_nameservers': []...

Read more...

Slawek Kaplonski (slaweq) wrote :

Rodolfo, thx for detailed analysis of this issue.
I agree that we should handle correctly ports added to the cache when the network is created" as You said. But isn't it what https://review.opendev.org/#/c/658307/ is doing? What else You would like to do here instead of just better checking if cache is out of sync?

Hi Slaweq:

The matter of [1] is to point out the unneeded number of times we need to resync the DHCP agent. You took the second option I commented in c#2. Maybe I prefer the first one, avoiding to introduce data for other resources (port) when a network is created.

For the records: the present bug can be solved in [2].

[1] https://bugs.launchpad.net/neutron/+bug/1760047/comments/24
[2] https://review.opendev.org/#/c/658307

Farhad Sunavala (fsbiz) wrote :
Download full text (9.1 KiB)

This fix helps and removes the semaphore that was present in the prior code but it is not fixing the root cause of the problem unfortunately.

Multiple port update messages are serialized to a Priority Queue. The messages are then drained serially without the code yielding. The end result still looks similar (somewhat better than without the fix).

The below is an output from my /var/log/dnsmasq.log when 8 baremetal nodes are provisioned simultaneously.
As can be seen dnsmasq could not process DHCP packets from 23:36:00 to 23:37:33 (over 93 seconds). This leads to the PXE boots on some of the baremetals timing out.

Oct 7 23:36:00 stg-sc-control01 dnsmasq-dhcp[2328509]: DHCPACK(ns-7f4ce6ff-2a) 10.33.27.101 6c:b3:11:4f:87:40 host-10-33-27-101
Oct 7 23:36:26 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:26 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:29 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:29 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:32 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:32 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:34 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:34 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:36 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:36 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:38 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:38 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:40 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:40 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:42 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:42 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:44 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:44 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/opts
Oct 7 23:36:46 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/lib/neutron/dhcp/077aa2d1-605c-48ec-842d-7dd6767bfd01/host
Oct 7 23:36:46 stg-sc-control01 dnsmasq-dhcp[2328509]: read /var/...

Read more...

Changed in neutron:
importance: Low → Medium
Oleg Bondarev (obondarev) wrote :

Similar to my comment https://bugs.launchpad.net/neutron/+bug/1862315/comments/3 - if IP is reused quickly, we may get such "Duplicate IP addresses" messages due to lower priority of port delete events than port create events.

Reviewed: https://review.opendev.org/707077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a0bb5763b277d7360554fd64813fdce01244d2fa
Submitter: Zuul
Branch: master

commit a0bb5763b277d7360554fd64813fdce01244d2fa
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423

Reviewed: https://review.opendev.org/708125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=de29b49b132b84f07329780958b503edbd5c77f6
Submitter: Zuul
Branch: stable/queens

commit de29b49b132b84f07329780958b503edbd5c77f6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/708124
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Submitter: Zuul
Branch: stable/rocky

commit 9499467caedfd2b30d2514f15b41cbde6e8a9fe6
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/708122
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Submitter: Zuul
Branch: stable/train

commit e894904a7d8dbb0e01d9c4d18b524de4b436ead3
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/708123
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Submitter: Zuul
Branch: stable/stein

commit 5a0c3e1fddd9863ff4d515e153ec67c0c3bbcb22
Author: Oleg Bondarev <email address hidden>
Date: Tue Feb 11 12:18:09 2020 +0400

    dhcp-agent: equalize port create_low/update/delete priority

    Low port delete priority may lead to duplicate entries in network
    cache if IPs are reused frequently.
    Also can't find a strict reason why it should be of lower priority.

    Change-Id: I55f858d50e636eb9091570b256380330b9ce9cb3
    Related-bug: #1862315
    Related-bug: #1828423
    (cherry picked from commit a0bb5763b277d7360554fd64813fdce01244d2fa)

tags: added: in-stable-stein
Slawek Kaplonski (slaweq) wrote :

Can we close this bug as resolved now? Or do we need anything else regarding to it?

Slawek Kaplonski (slaweq) wrote :

I think that with https://review.opendev.org/c/openstack/neutron/+/773160 we can close this one as it should be fixed now.

Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers