DHCP agent block network-vif-plugged cause Instance failed to build

Bug #1778616 reported by LIU Yulong
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

ENV:
Neutron stable/queens (12.0.1)
CentOS 7 (3.10.0-514.26.2.el7.x86_64)
Ceph v10.2.9 Jewel

How to reproduce:
Concurrently create 256 VMs in a single network which has 2 dhcp agents.

Exception:
nova-compute side:
2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325
2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds

neutron server log:
A failed nova boot, the neutron port only have one log:
2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73

It will never get the `Provisioning complete` notification.

Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP:
2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73
2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132
2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138

Code:
For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot.
For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428

Potential solution:
Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works.
But, such config is really not so good for the production envrionment.

tags: added: loa
tags: added: loadimpact
removed: loa
tags: added: l3-ipam-dhcp
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Thanks for your report, this is a tough one! Can you provide a bit more details on your setup the nature of your network (single network as with a single subnet)? Thanks for your analysis on the potential root cause. Is 256 the absolute maximum of concurrent VMs being actively spawned?

Changed in neutron:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Marking incomplete waiting to hear on scalability requirements. Also you point out that scaling to 10 agents does help, how did you come up with that number? Does it mean that it still fails at 9 agents?

Revision history for this message
LIU Yulong (dragon889) wrote :

dhcp_agents_per_network = 10 seems not work either, we still got some VMs with boot error "Failed to allocate the network(s) ... ".

@Armando
Q: Can you provide a bit more details on your setup the nature of your network (single network as with a single subnet)?
A: yes, one network with one subnet.

Q: Is 256 the absolute maximum of concurrent VMs being actively spawned?
A: no, concurrently create 256 VMs, and the result is 250 ACTIVE VMs, 6 ERROR vms with fault "Failed to allocate the network(s) ... ".
The exception log in nova is:
"for instance with vm_state building and task_state spawning.: Timeout: 300 seconds."
Nova never get "network-vif-plugged" notification, so this bug looks belong to neutron.

Q: Also you point out that scaling to 10 agents does help, how did you come up with that number? Does it mean that it still fails at 9 agents?
A: We've tested the 2/5/10, 10 gets the minimal ERROR vm.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I wonder if it is possible to add test which reproduce that issue to rally and run it on gate. IMO it would be nice but might be not possible to run such "big" test on gate infra

LIU Yulong (dragon889)
Changed in neutron:
status: Incomplete → New
Revision history for this message
LIU Yulong (dragon889) wrote :

I think this bug is duplicated to:
https://bugs.launchpad.net/neutron/+bug/1760047

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.