Virtual interface creation failed

Bug #1951720 reported by Michal Arbet
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Opinion
Undecided
Unassigned
neutron
New
Undecided
Unassigned

Bug Description

Hi,

I have testig stack of openstack wallaby (deployed with kolla-ansible with kolla source images)
and found probably weird bug in nova/neutron.

I have testing heat template which was starting about 6 instances with bunch of network interfaces with security group - nothing special. Testing openstack ENV is clean, fresh installed, tempest passing. So what is going on ?

1. Sometimes heat stack is created successfully without error
2. Sometimes heat stack is created sucessfully - BUT with errors in nova-compute - so retry
   mechanism works

Errors in nova-compute :

7d241eaae5bb137aedc6fcc] [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] Took 0.10 seconds to destroy the instance on the hypervisor.
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [req-5b21ebad-ffb7-46b0-8c37-fd665d01013e 64fe2842ff8c6302c0450bee25600a10e54f2b9793e9c8776f956c993a7a7ee8 0960461696f64f82ba108f8397bf508c - e01e19b257d241eaae5bb137aedc6fcc e01e19b2
57d241eaae5bb137aedc6fcc] [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] Traceback (most recent call last):
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6930, in _create_guest_with_network
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] guest = self._create_guest(
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3.9/contextlib.py", line 124, in __exit__
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] next(self.gen)
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] actual_event = event.wait()
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/eventlet/event.py", line 125, in wait
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] result = hub.switch()
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 313, in switch
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] return self.greenlet.switch()
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] eventlet.timeout.Timeout: 300 seconds
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e]
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] During handling of the above exception, another exception occurred:
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e]
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] Traceback (most recent call last):
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2366, in _build_and_run_instance
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] self.driver.spawn(context, instance, image_meta,
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3885, in spawn
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] self._create_guest_with_network(
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6956, in _create_guest_with_network
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] raise exception.VirtualInterfaceCreateException()
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2021-11-03 14:56:42.453 7 ERROR nova.compute.manager [instance: 69b8dd7e-a9db-44fb-ab41-832942cb9e7e]

So, it looks like nova didn't received message from neutron that vif is plugged in in 300 seconds.

Then I patched nova code with this review, so I can see times of this event.

https://review.opendev.org/c/openstack/nova/+/817030

Logs here :

2021-11-09 14:07:07.251 9 DEBUG nova.compute.manager [req-ff47242a-ec1e-4f3e-8989-dc17b9b66702 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 280 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490
2021-11-09 14:08:49.726 9 DEBUG nova.compute.manager [req-6f3db32a-0c79-4d70-b61f-018e5bb6047e 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 8 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490
2021-11-09 14:08:53.898 9 DEBUG nova.compute.manager [req-10eb5d9e-e534-4f41-b7b8-019455cd7df2 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 10 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490
2021-11-09 14:09:29.747 9 DEBUG nova.compute.manager [req-6f6b3769-46a9-4188-8db3-fdc7a1f33c5d 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 46 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490
2021-11-09 14:09:31.846 9 DEBUG nova.compute.manager [req-dd2d6dd5-5197-4c21-8f7f-48c58b2aa397 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 43 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490
2021-11-09 14:09:36.003 9 DEBUG nova.compute.manager [req-2e836d62-d368-45a9-bcbd-6162b126d556 612fb894e80a454881a7ffd6ff3565e7 3ea2f30b9a7445ccb6151d27d65e9c8b - default default] VIF plugging completed in 48 seconds wait_for_instance_event /usr/lib/python3/dist-packages/nova/compute/manager.py:490

Once again, that openstack is clean, fresh installed , no instance running, so why there is VIF plugging completed from 8 sec to 280 seconds (and potentionally more) ?

It looks like there is some race condition , but I don't know where to find it.

When I bump timeout for vif plugging in nova compute below , everything is working well, but it's only workaround.

vif_plugging_timeout = 3600

Last thing what I see is that it is something with security group, if noop driver is used for security group, I can't reproduce issue, so it's "fixed".

Testing environment using OVS (not OVN) , and firewall driver openvswitch.

Thank you very much,
Michal Arbet (kekvo)

Revision history for this message
Michal Arbet (michalarbet) wrote :

Looks like same behaviour confirmed here https://bugs.launchpad.net/nova/+bug/1944779

Revision history for this message
Dan Smith (danms) wrote :

Is this a Nova bug? Seems like your analysis here points to Neutron delaying the event in some configurations. Shouldn't this be a bug against Neutron?

Revision history for this message
Michal Arbet (michalarbet) wrote :

I really don't know, I'm not skilled as you guys :). But yeah, it looks like it's neutron bug, I created LP under nova project because it's one and only ERROR log visible, even if I have debug turned on in both nova and neutron projects.

But hope nova team will discuss this with neutron team, won't you ?

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Yeah, I'll put it to the Neutron team to ask them to look at this bug.
In case they say it's a Nova bug, please modify the nova status to "New" again.

Thanks.

Changed in nova:
status: New → Opinion
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

In order to debug this problem we'll need the Neutron server, OVS agent and DHCP agent logs. We need to see that the port is provisioned by the DHCP agent (an IP is assigned) and the OVS agent binds the port correctly. Once this process is done, the Neutron server considers the port is provisioned and sends the vif-plugged event.

Can you provide this info or find that in your logs?

Regards.

Revision history for this message
Michal Arbet (michalarbet) wrote :

+ Adding logs

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Ideally, we'll need a synthetic reproducer, like [0], and regression unit tests in Nova that simulate corner cases for receiving vif-plugged events, and clearly defines how Nova reacts upon each of it:

 1) unexpected event received within the agreed time-frame (neutron did conform to the contract... twice)
 2) nova starts waiting for the event after it gets received (no receiving handler ready yet)
    2.5) the event arrives before nova starts waiting for it, then a duplicate arrives in the expected time-frame
 3) the even gets received after nova stopped waiting for it (neutron didn't conform to the contract)
 4) nova never gets the event within the agreed time-frame (message lost)

As you can see, there is quite a few corner cases for that scenario

[0] https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/tests/functional/regressions/test_bug_1831771.py

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.