some ports does not become ACTIVE during provisioning

Bug #1760047 reported by Piotr Misiak on 2018-03-30
50
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
High
yangjianfeng

Bug Description

release: Pike

When I try to spawn a 100 VMs at once almost every time it finishes with a dozen VMs in ERROR state with cause:

Error: Build of instance 30ba80c7-6588-4916-80cb-ed3192b61e36 aborted: Failed to allocate the network(s), not rescheduling

I debugged this a bit and figured out that a neutron port connected to failed VM didn't become active because the DHCP provisioning block entity hadn't been removed for this port.

For this port I see a following logs in the DHCP agent:

2018-03-30 08:05:21.131 7 DEBUG neutron.agent.dhcp.agent [req-b23f4cd3-bdd8-4fbc-bda5-27ad030109b5 aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:29Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
2018-03-30 08:05:27.376 7 DEBUG neutron.agent.dhcp.agent [req-e391a8ef-be5a-4768-b9b7-39398731f2de 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute12, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=4, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:30Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
2018-03-30 08:05:27.378 7 DEBUG neutron.agent.dhcp.agent [req-e391a8ef-be5a-4768-b9b7-39398731f2de 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute12, binding:profile=, binding:vif_details=datapath_type=system, ovs_hybrid_plug=True, port_filter=True, binding:vif_type=ovs, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=5, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:31Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
2018-03-30 08:08:50.283 7 DEBUG neutron.agent.dhcp.agent [req-b23f4cd3-bdd8-4fbc-bda5-27ad030109b5 aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:29Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
2018-03-30 08:09:01.605 7 DEBUG neutron.agent.dhcp.agent [req-e391a8ef-be5a-4768-b9b7-39398731f2de 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute12, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=4, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:30Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
2018-03-30 08:09:01.608 7 DEBUG neutron.agent.dhcp.agent [req-e391a8ef-be5a-4768-b9b7-39398731f2de 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute12, binding:profile=, binding:vif_details=datapath_type=system, ovs_hybrid_plug=True, port_filter=True, binding:vif_type=ovs, binding:vnic_type=normal, created_at=2018-03-30T07:59:29Z, description=, device_id=f101ca2f-f107-4268-9dca-816d4fce7979, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.25'}], id=35c4b052-c7d8-487b-a213-aef3a3f4e9d3, mac_address=fa:16:3e:28:89:81, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=5, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:31Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409

In comparison on a working port which transitioned to ACTIVE state I have:

2018-03-30 07:59:55.399 7 INFO neutron.agent.dhcp.agent [req-15e628bd-48f3-4fae-b27c-4364483d114a aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:28Z
2018-03-30 08:05:23.448 7 INFO neutron.agent.dhcp.agent [req-7fc32b9a-a76a-46e0-82c9-c7a674f7262f 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute13, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=4, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:29Z
2018-03-30 08:05:26.051 7 INFO neutron.agent.dhcp.agent [req-7fc32b9a-a76a-46e0-82c9-c7a674f7262f 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute13, binding:profile=, binding:vif_details=datapath_type=system, ovs_hybrid_plug=True, port_filter=True, binding:vif_type=ovs, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=5, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:30Z
2018-03-30 08:08:43.248 7 INFO neutron.agent.dhcp.agent [req-15e628bd-48f3-4fae-b27c-4364483d114a aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:28Z
2018-03-30 08:08:54.850 7 INFO neutron.agent.dhcp.agent [req-7fc32b9a-a76a-46e0-82c9-c7a674f7262f 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute13, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=4, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:29Z
2018-03-30 08:08:59.332 7 INFO neutron.agent.dhcp.agent [req-7fc32b9a-a76a-46e0-82c9-c7a674f7262f 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute13, binding:profile=, binding:vif_details=datapath_type=system, ovs_hybrid_plug=True, port_filter=True, binding:vif_type=ovs, binding:vnic_type=normal, created_at=2018-03-30T07:59:28Z, description=, device_id=2ca93b42-3da0-456b-92ec-f75bdf97165b, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.12'}], id=1176eac0-2fb0-4088-98c8-626af0484b17, mac_address=fa:16:3e:2c:fe:68, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=5, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-03-30T07:59:30Z

I have no idea why for some ports DHCP agent think a port is stale but for others it's not.

Any clues?

Piotr Misiak (piotr-misiak) wrote :

I forgot to mention that we have Neutron 11.0.3

Is it reproducible in small scale or only seen in large scale.
Yes we need to figure out why the dhcp is thinking certain ports are stale ports.

Needs triaging.

Piotr Misiak (piotr-misiak) wrote :

When I spawn one or a few or a dozen VMs at once I can't reproduce the issue.
Seems the issue is only during large provisioning.

I will try to add some additional debug messages to the code to find the reason of "stale port".

Piotr Misiak (piotr-misiak) wrote :
Download full text (9.7 KiB)

So in my case the port is stale in DHCP agent because it is in NetworkCache.deleted_ports set.

The fun fact is that those problematic ports never hit NetworkCache.put_port() function. I added extra debug message in this function but I have no messages with those problematic ports (I have messages with other ports).

For example port dbe7ba5a-7340-428c-a6a6-bda7843dede8.
The port has been created around 07:23:39 by Nova.
The port has been deleted around 07:28:46 by Nova after 300 sec timeout.

Log from neutron-server.log:

2018-04-04 07:23:39.071 33 DEBUG neutron.db.provisioning_blocks [req-de2bd147-5db1-435a-864b-c10b63a79c1f aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - default default] Transition to ACTIVE for port object dbe7ba5a-7340-428c-a6a6-bda7843dede8 will not be triggered until provisioned by entity DHCP. add_provisioning_component /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73

DHCP agent logs on ctrl3:

grep -B1 dbe7ba5a-7340-428c-a6a6-bda7843dede8 neutron-dhcp-agent.log

2018-04-04 07:29:35.504 7 DEBUG neutron.agent.dhcp.agent [req-de2bd147-5db1-435a-864b-c10b63a79c1f aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] PMDEBUG is_port_message_stale: port in deleted_ports is_port_message_stale /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:624
2018-04-04 07:29:35.504 7 DEBUG neutron.agent.dhcp.agent [req-de2bd147-5db1-435a-864b-c10b63a79c1f aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-04-04T07:23:38Z, description=, device_id=74ec5382-110f-4590-a897-2ecdd46bcf4f, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.26'}], id=dbe7ba5a-7340-428c-a6a6-bda7843dede8, mac_address=fa:16:3e:eb:f2:64, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-04-04T07:23:38Z port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:409
--
2018-04-04 07:29:35.512 7 DEBUG neutron.agent.dhcp.agent [req-fc24211e-33ef-4dcb-8104-8442dcbfe6f1 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] PMDEBUG is_port_message_stale: port in deleted_ports is_port_message_stale /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:624
2018-04-04 07:29:35.513 7 DEBUG neutron.agent.dhcp.agent [req-fc24211e-33ef-4dcb-8104-8442dcbfe6f1 2f1c221e2bfb476bb9949244268b9596 fecb9ee2da694abeb723e2068b29ac3e - - -] Discarding stale port update: admin_state_up=True, allowed_address_pairs=[], binding:host_id=cds-compute11, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-04-04T07...

Read more...

Piotr Misiak (piotr-misiak) wrote :

Finally I found it.. DHCP agent waits a couple of minutes to acquire lock on network in function DhcpAgent.port_update_end():

2018-04-04 13:48:50.457 7 DEBUG oslo_concurrency.lockutils [req-a01c7740-95a0-44ab-8657-5e04ad82d1b5 aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Acquired semaphore "dhcp-agent-network-lock-b86c5078-65fe-480c-b658-300274dba0a2" lock /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212

In meantime DHCP agent receives rpc call DhcpAgent.port_delete_end() and adds the port to the deleted_ports() set.

So... this port will never be configured on DHCP agent side.

Piotr Misiak (piotr-misiak) wrote :
Download full text (9.1 KiB)

Here is code with my debug message:

    @_wait_if_syncing
    def port_update_end(self, context, payload):
        """Handle the port.update.end notification event."""
        updated_port = dhcp.DictModel(payload['port'])
        LOG.debug("PMDEBUG port_update_end: port %s , payload %s",updated_port,payload)
        with _net_lock(updated_port.network_id):
            if self.cache.is_port_message_stale(payload['port']):
                LOG.debug("Discarding stale port update: %s", updated_port)
                return

Here is log:

2018-04-04 13:38:45.697 7 DEBUG neutron.agent.dhcp.agent [req-a01c7740-95a0-44ab-8657-5e04ad82d1b5 aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] PMDEBUG port_update_end: port admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2018-04-04T13:38:43Z, description=, device_id=64869685-5c86-4733-a947-bffbdc113f79, device_owner=, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.5'}], id=2a00efeb-d5e6-4070-b2b8-64b484368808, mac_address=fa:16:3e:24:7e:40, name=, network_id=b86c5078-65fe-480c-b658-300274dba0a2, port_security_enabled=True, project_id=579ef78d275a4019865bca2057af3caf, revision_number=3, security_groups=[u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], status=DOWN, tags=[], tenant_id=579ef78d275a4019865bca2057af3caf, updated_at=2018-04-04T13:38:43Z , payload {u'port': {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2018-04-04T13:38:43Z', u'device_owner': u'', u'revision_number': 3, u'binding:profile': {}, u'port_security_enabled': True, u'fixed_ips': [{u'subnet_id': u'bd145992-d7fe-4fc8-acc7-ba510f8388e6', u'ip_address': u'10.7.0.5'}], u'id': u'2a00efeb-d5e6-4070-b2b8-64b484368808', u'security_groups': [u'b165182d-cd0f-42a5-b5d4-5434b9079aa8'], u'binding:vif_details': {}, u'binding:vif_type': u'unbound', u'mac_address': u'fa:16:3e:24:7e:40', u'project_id': u'579ef78d275a4019865bca2057af3caf', u'status': u'DOWN', u'binding:host_id': u'', u'description': u'', u'tags': [], u'device_id': u'64869685-5c86-4733-a947-bffbdc113f79', u'name': u'', u'admin_state_up': True, u'network_id': u'b86c5078-65fe-480c-b658-300274dba0a2', u'tenant_id': u'579ef78d275a4019865bca2057af3caf', u'created_at': u'2018-04-04T13:38:43Z', u'binding:vnic_type': u'normal'}} port_update_end /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:407
2018-04-04 13:48:50.457 7 DEBUG oslo_concurrency.lockutils [req-a01c7740-95a0-44ab-8657-5e04ad82d1b5 aae866ac58414f81953c57df2494eaf9 579ef78d275a4019865bca2057af3caf - - -] Acquired semaphore "dhcp-agent-network-lock-b86c5078-65fe-480c-b658-300274dba0a2" lock /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-04-04 13:48:50.457 7 DEBUG neutron.agent.dhcp.agent [req-a0...

Read more...

Lujin Luo (luo-lujin) wrote :

I am not sure if it could possibly due to large number of VMs spawn at the same time ran out of threads in your env. It is also difficult for me to reproduce it now, but I am marking this as confirmed. Would you please share more details of your environment? I see you are using Kolla. Do you deploy AIO or multi-nodes? Are you running Kolla in VM or physical machine? etc..

Changed in neutron:
status: New → Confirmed
Piotr Misiak (piotr-misiak) wrote :

The env has 3 controller nodes and 40 compute nodes. But for debugging purposes I spawn VMs on 6 compute nodes with debugging enabled. The env is not busy. I use kolla.

Piotr Misiak (piotr-misiak) wrote :

All servers all physical. We use DVR, L3_HA and L2 population.

dhcp_agent.ini:

[DEFAULT]
dnsmasq_config_file = /etc/neutron/dnsmasq.conf
enable_isolated_metadata = true
force_metadata = true
dnsmasq_dns_servers = 8.8.8.8,8.8.4.4

[ovs]
ovsdb_interface = native
ovsdb_connection = tcp:127.0.0.1:6640

most important settings from neutron.conf:

[DEFAULT]
debug = True
api_workers = 20
metadata_workers = 20
rpc_workers = 12
rpc_state_report_workers = 12
interface_driver = openvswitch
core_plugin = ml2
service_plugins = lbaasv2,firewall,router
dhcp_agents_per_network = 2
max_l3_agents_per_router = 3
router_distributed = True
l3_ha = true

[oslo_concurrency]
lock_path = /var/lib/neutron/tmp

Piotr Misiak (piotr-misiak) wrote :

Maybe I have a hint. I spawn 100 VMs using horizon. So VMs have a name with a numeric suffix from 1 to 100. I think (I'm not sure) each time ERROR state have VMs with suffixes between 10 and 20 one by one, for example: 11,12,13,14,15 and VMs from 1 to 10 and from 16 to 100 are OK.
Maybe it is correlated with rpc_workers? But why does this happens only once during provisioning and not every 12 VMs?

Does semaphore locking need an additional worker?

Weird thing is that when one worker is waiting for the network lock, other workers are locking and releasing the same lock without any problems, many times.

Changed in neutron:
importance: Undecided → High
status: Confirmed → Triaged
LIU Yulong (dragon889) wrote :

We meet such issue too in stable/queens, the root cause was same as Piotr Misiak (piotr-misiak) said in comment #7.
I submitted a new bug here, but now it is marked duplicate to this.
https://bugs.launchpad.net/neutron/+bug/1778616

Fan Zhang (fanzhang) wrote :

Hi, any updates or recommendations from neutron team?

yangjianfeng (yangjianfeng) wrote :

As LIU Yulong described in bug/1778616, The neutron must get the `Provisioning complete` notification for port before it sends 'network-vif-plugge' event to nova.

For current code, we only need one dhcp agent call dhcp_ready_on_ports to remove port's 'provisioning block'. But The 'Provisioning complete' port is still blocked other port's processing in other dhcp agents. So, I suggest that we can optimize the dispatch algorithm to improve the process.

In server side, We can cast two levels message for one port to agent. The high-level message only cast to one agent. The low-level message cast to all agent.
In agent side, We preferentially process the high-level message. The low-level message will be put to cache and processing later.

In this way, the neutron will get `Provisioning complete` notification for per port rapidly.

Changed in neutron:
assignee: nobody → yangjianfeng (yangjianfeng)
LIU Yulong (dragon889) on 2018-12-30
Changed in neutron:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/626830
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=99f4495c940011293e3cabbb590770dc1e7b6900
Submitter: Zuul
Branch: master

commit 99f4495c940011293e3cabbb590770dc1e7b6900
Author: yangjianfeng <email address hidden>
Date: Fri Dec 21 14:08:38 2018 +0800

    Improve port dhcp Provisioning

    Currently, the dhcp Provisioning of ports is the crucial bottleneck
    of that concurrently boot multiple VM.

    The root cause is that these ports will be processed one by one by dhcp
    agent when they belong to the same network, And the 'Provisioning complete'
    port is still blocked other port's processing in other dhcp agents. The
    patch aim to optimize the dispatch strategy of the port cast to agent to
    improve the Provisioning process.

    In server side, I classify messages to multi levels. Especially, I classify
    the port_update_end or port_create_end message to two levels, the high-level
    message only cast to one agent, the low-level message cast to all agent. In
    agent side I put these messages to `resource_processing_queue`, with the queue,
    We can delete `_net_lock` and process these messages in order of priority.

    Additonally, I modified the `resource_processing_queue` for my demand. I update
    `_queue` from LIST to PriorityQueue in `ExclusiveResourceProcessor`, by this
    way, we can sort all message which cached in `ExclusiveResourceProcessor` by
    priority.

    Related-Bug: #1760047
    Change-Id: I255caa0571c42fb012fe882259ef181070beccef

tags: added: neutron-proactive-backport-potential

Hi all, I am also experiencing same kind of issue. I am using latest neutron code

root@ctrl3:~# neutron-dhcp-agent --version
neutron-dhcp-agent 14.0.0.0b1
14.0.0~b1~git2019013137.7484700deb-0ubuntu1~cloud0

But the funny part is mine is not a busy environment. I have 3 ctrls and 5 compute nodes. When i try to launch a vm it may fail. But if i restart the dhcp agents while a vm spawns , then the port gets DHCP provisining completed.

I think there is a problem with network cache in my case.

Can any one explain how does network cache works in the agent case?

May be DHCP agent port cache is out of sync with neutron server.

Gaëtan Trellu (goldyfruit) wrote :

Any chance to have this cherry pricked in Rocky ?

Reviewed: https://review.openstack.org/649573
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=056e049e2b9c7db0802092691d63dcc1b9705a29
Submitter: Zuul
Branch: stable/rocky

commit 056e049e2b9c7db0802092691d63dcc1b9705a29
Author: yangjianfeng <email address hidden>
Date: Fri Dec 21 14:08:38 2018 +0800

    Improve port dhcp Provisioning

    Currently, the dhcp Provisioning of ports is the crucial bottleneck
    of that concurrently boot multiple VM.

    The root cause is that these ports will be processed one by one by dhcp
    agent when they belong to the same network, And the 'Provisioning complete'
    port is still blocked other port's processing in other dhcp agents. The
    patch aim to optimize the dispatch strategy of the port cast to agent to
    improve the Provisioning process.

    In server side, I classify messages to multi levels. Especially, I classify
    the port_update_end or port_create_end message to two levels, the high-level
    message only cast to one agent, the low-level message cast to all agent. In
    agent side I put these messages to `resource_processing_queue`, with the queue,
    We can delete `_net_lock` and process these messages in order of priority.

    Additonally, I modified the `resource_processing_queue` for my demand. I update
    `_queue` from LIST to PriorityQueue in `ExclusiveResourceProcessor`, by this
    way, we can sort all message which cached in `ExclusiveResourceProcessor` by
    priority.

    Conflicts:
        neutron/tests/unit/agent/dhcp/test_agent.py

    Related-Bug: #1760047
    Change-Id: I255caa0571c42fb012fe882259ef181070beccef
    (cherry picked from commit 99f4495c940011293e3cabbb590770dc1e7b6900)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/649580
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dd5e89e717ef0993d1de4c97d1f0d3132f2845e5
Submitter: Zuul
Branch: stable/queens

commit dd5e89e717ef0993d1de4c97d1f0d3132f2845e5
Author: yangjianfeng <email address hidden>
Date: Fri Dec 21 14:08:38 2018 +0800

    Improve port dhcp Provisioning

    Currently, the dhcp Provisioning of ports is the crucial bottleneck
    of that concurrently boot multiple VM.

    The root cause is that these ports will be processed one by one by dhcp
    agent when they belong to the same network, And the 'Provisioning complete'
    port is still blocked other port's processing in other dhcp agents. The
    patch aim to optimize the dispatch strategy of the port cast to agent to
    improve the Provisioning process.

    In server side, I classify messages to multi levels. Especially, I classify
    the port_update_end or port_create_end message to two levels, the high-level
    message only cast to one agent, the low-level message cast to all agent. In
    agent side I put these messages to `resource_processing_queue`, with the queue,
    We can delete `_net_lock` and process these messages in order of priority.

    Additonally, I modified the `resource_processing_queue` for my demand. I update
    `_queue` from LIST to PriorityQueue in `ExclusiveResourceProcessor`, by this
    way, we can sort all message which cached in `ExclusiveResourceProcessor` by
    priority.

    Conflicts:
        neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py

    Related-Bug: #1760047
    Change-Id: I255caa0571c42fb012fe882259ef181070beccef
    (cherry picked from commit 99f4495c940011293e3cabbb590770dc1e7b6900)
    (cherry picked from commit 740295d94dad0063663da3e2db96f9920e813a90)

tags: added: in-stable-queens
Michael Polenchuk (mpolenchuk) wrote :

After this patch we've started to get "Duplicate IP addresses found, DHCP cache is out of sync" messages (but actually it ain't true, see log below) really often, which sometimes causes vm unable to get an ip address from DHCP server. Therefore resync with neutron server is triggered causing unwanted events.

CACHE: [{u'status': u'DOWN', u'binding:host_id': u'', u'description': u'', u'allowed_address_pairs': [], u'tags': [], u'extra_dhcp_opts': [], u'updated_at': u'2019-04-19T09:23:02Z', u'device_owner': u'network:router_interface', u'revision_number': 3, u'binding:profile': {}, u'port_security_enabled': True, u'fixed_ips': [{u'subnet_id': u'fdb78251-4c22-4716-aead-911a3d5651ee', u'ip_address': u'192.168.120.3'}], u'id': u'ea0242a0-36d1-4fce-bc4d-e0b4915ece57', u'security_groups': [u'bbd520f2-37ff-4cd2-91a9-5f613b818d1b'], u'device_id': u'bb304e0f-f497-4133-ad24-9a756ff0495d', u'name': u'', u'admin_state_up': True, u'network_id': u'20cafb8f-5edf-4074-a6ec-0553ab5150bd', u'tenant_id': u'2cf1eb36e2cc4f8991659fbc1b6d39eb', u'binding:vif_details': {}, u'binding:vnic_type': u'normal', u'binding:vif_type': u'unbound', u'mac_address': u'fa:16:3e:5d:a9:ce', u'project_id': u'2cf1eb36e2cc4f8991659fbc1b6d39eb', u'created_at': u'2019-04-19T09:22:58Z'}, {u'status': u'ACTIVE', u'binding:host_id': u'gtw01', u'description': u'', u'allowed_address_pairs': [], u'tags': [], u'extra_dhcp_opts': [], u'updated_at': u'2019-04-19T09:23:03Z', u'device_owner': u'network:dhcp', u'revision_number': 3, u'binding:profile': {}, u'port_security_enabled': False, u'fixed_ips': [{u'subnet_id': u'fdb78251-4c22-4716-aead-911a3d5651ee', u'ip_address': u'192.168.120.2'}], u'id': u'8e27eaa5-92c0-49ff-919d-9808da6ef61e', u'security_groups': [], u'device_id': u'dhcpf3c106f5-c8d0-5c49-8f89-8de3aa25f954-20cafb8f-5edf-4074-a6ec-0553ab5150bd', u'name': u'', u'admin_state_up': True, u'network_id': u'20cafb8f-5edf-4074-a6ec-0553ab5150bd', u'tenant_id': u'2cf1eb36e2cc4f8991659fbc1b6d39eb', u'binding:vif_details': {u'support_vhost_user': False, u'has_datapath_type_netdev': False, u'uuid': u'2f0a948b-d36d-4e46-8e75-82a01a90c6f8', u'host_addresses': [u'10.1.0.2']}, u'binding:vnic_type': u'normal', u'binding:vif_type': u'ovs', u'mac_address': u'fa:16:3e:61:b4:19', u'project_id': u'2cf1eb36e2cc4f8991659fbc1b6d39eb', u'created_at': u'2019-04-19T09:22:58Z'}]

NEW: admin_state_up=True, allowed_address_pairs=[], binding:host_id=, binding:profile=, binding:vif_details=, binding:vif_type=unbound, binding:vnic_type=normal, created_at=2019-04-19T09:22:58Z, description=, device_id=bb304e0f-f497-4133-ad24-9a756ff0495d, device_owner=network:router_interface, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'fdb78251-4c22-4716-aead-911a3d5651ee', u'ip_address': u'192.168.120.3'}], id=ea0242a0-36d1-4fce-bc4d-e0b4915ece57, mac_address=fa:16:3e:5d:a9:ce, name=, network_id=20cafb8f-5edf-4074-a6ec-0553ab5150bd, port_security_enabled=True, project_id=2cf1eb36e2cc4f8991659fbc1b6d39eb, revision_number=2, security_groups=[u'bbd520f2-37ff-4cd2-91a9-5f613b818d1b'], status=DOWN, tags=[], tenant_id=2cf1eb36e2cc4f8991659fbc1b6d39eb, updated_at=2019-04-19T09:23:00Z

Version: 13.0.3 (+stable/rocky)

Slawek Kaplonski (slaweq) wrote :

Thx Michael for this info. I opened new bug for that https://bugs.launchpad.net/neutron/+bug/1828423

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers