nova may leak net interface in guest if port under attaching/deleting

Bug #1934742 reported by Alexandre arents
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Alexandre arents
neutron
New
Undecided
Unassigned

Bug Description

Description
===========

It seems that nova may leak network interface in guest
if a port deletion is run in the middle of the a port attachment

in compute manager, attach_interface run atomically
the following tasks:
-update port in neutron(Binding)
-...
-driver.attach_interface()
-update net_info_cache
-...

When a Bound port is deleted, nova receive an event
"network-vif-deleted" and process it by running
def _process_instance_vif_deleted_event()
 ....
 driver.detach_interface()

if this event processing is done just after port binding
and before driver.attach_interface() of an
ongoing interface attachment of the same port,
nova will attach the deleted orphan interface to guest

Probably, the this event processing must be synchronized
with compute manager method attach_interface/detach_interface.

Steps to reproduce
==================

on master devstack:

$openstack server create --flavor m1.small --image cirros-0.5.2-x86_64-disk \
--nic net-id=private myvm
$openstack port create --network private myport

# For ease of reproduction add a pause just before driver.attach_interface():

nova/compute/manager.py:
def attach_interface()
 try:
   time.sleep(8)
   self.driver.attach_interface(context, ...)

$sudo service devstack@n-cpu restart

$openstack server add port myvm myport &
$sleep 4 ; openstack port delete myport
[1]+ Exit 1 openstack server add port myvm myport
Port id 3d47bceb-34ef-4002-8e33-30957127a87f could not be found. (HTTP 404) (Request-ID: req-6c056ad3-1e61-4102-9e5e-48cdd4dffc43)

$ nova interface-list alex
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+-----+
| Port State | Port ID | Net ID | IP addresses | MAC Addr | Tag |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+-----+
| ACTIVE | 0fe9365b-5747-4532-be50-e6362b10b645 | d8f03257-d1e2-4488-bc42-0e189481a6c7 | 10.0.0.49,fde5:2b4:b028:0:f816:3eff:feb8:f14c | fa:16:3e:b8:f1:4c | - |
+------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+-----+

$ virsh domiflist instance-00000001
 Interface Type Source Model MAC
--------------------
 tap0fe9365b-57 bridge br-int virtio fa:16:3e:b8:f1:4c
 tapdcbbae72-0b bridge br-int virtio fa:16:3e:95:91:25

Expected result
===============
interface should not be attached to guest

Actual result
=============
zombie interface is attached to guest

Changed in nova:
assignee: nobody → Alexandre arents (aarents)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/799606

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Alexandre arents (aarents) wrote :

There is still more cases where we can leak,
that the first proposed fix do not cover:

Let's enumerates all issues A) B) C):

A) _process_instance_vif_deleted_event can pass completly
in the middle of an attaching interface, this is what
explain the bug description and it is fix by
https://review.opendev.org/c/openstack/nova/+/799606 patchset 1

B) _process_instance_vif_deleted_event is called with
instance object that do not contains the vif in infos cache.
When a port is attaching, compute manager first do a port
binding and after update network info_cache in DB.
between both operation, if there is a neutron PORT DELETE,
neutron send an API CALL to nova in order to
trigger network network-vif-deleted
The issue is nova api create instance obj from db
that do not yet contains the attached interface
(info_cache not yet updated by compute mgr)
And while processing _process_instance_vif_deleted_event
we never enter in the condtion:
if vif['id'] == deleted_vif_id, so we leak.

C) 'network-change' OR _heal_instance_info_cache()
(all that is running network_api.get_instance_nw_info() )
processing can detect and drop vif entry in cache before
'network-vif-deleted' run and we fall again in the B) issue.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

just captureing some of the info form our irc converstation.

in general we do not support this usecase.

from a nova perspective delete an attach neutron port is an end user logic error.
it is not a supported way to detach a neutron port form an instance.

the current code we have is a minimal attempt to do the right thing do some basic clean up but
you should not detach ports this way.

we could harden that basic cleanup but it does not really change the fact this workflow is not supported. we also do not support detaching ports by clearing the device-id and device owner filed on a neuton port. it can trigger a similar code path but its no intended to work.

even for normal ovs port ignoring conccent delete or events we wont actully clean up everything.

we could try and harden the nova workaround as a backportable solution in the sort term but really we should be blocking this in the neutron api so we likely need a neutron change to adress this.

there are other apptochses we could take rather then a block but effectivly support this is an rfe that requires changes in both nova and neutron but there are some mitigation we can and proably should implemente in a backportable bug fix first.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.