VIFs not always detached from ironic nodes during termination

Bug #1733861 reported by Mark Goddard on 2017-11-22
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Mark Goddard

Bug Description

Description
===========

Sometimes when a baremetal instance is terminated, some VIFs are not detached from the node. This can lead to the node becoming unusable, with subsequent attempts to provision it fail during VIF attachment due to there being insufficient free ironic ports to attach the VIF to.

Steps to reproduce
==================

No reproduction procedure identified as yet, but will be something like:

* boot one baremetal instance
* do something to trigger the bug
* delete the instance
* boot a second instance on the same ironic node

Expected results
================

The second instance should boot successfully.

Actual results
==============

The second instance fails to boot, and the following error message is emitted by nova-compute:

VirtualInterfacePlugException: Cannot attach VIF 409830a5-b4de-4d1d-be22-5e6fe4ccd65b to the node 3aaaf79e-99fb-42a3-b22e-b1a7fae44272 due to error: Unable to attach VIF 409830a5-b4de-4d1d-be22-5e6fe4ccd65b, not enough free physical ports. (HTTP 400)

The neutron port has been deleted:

$ openstack port show 7e567468-53a2-4fad-8bc9-a30a0e7218a0
ResourceNotFound: No Port found for 7e567468-53a2-4fad-8bc9-a30a0e7218a0

The ironic node's VIF is still attached:

$ openstack baremetal node vif list <node>
+--------------------------------------+
| ID |
+--------------------------------------+
| 7e567468-53a2-4fad-8bc9-a30a0e7218a0 |
+--------------------------------------+

Workaround
==========

The VIF can be manually detached via ironic:

$ openstack baremetal node vif detach <node> 7e567468-53a2-4fad-8bc9-a30a0e7218a0

This allows instances to be deployed on the node.

Environment
===========

RDO Pike, deployed on CentOS 7 using kayobe & kolla-ansible.

openstack-nova-api-16.0.0-1.el7.noarch

Notes
=====

I've seen this happen on a number of occasions, and have spent some time investigating a few of them. Although they all have similarities, no two have been the same, so far as I can tell.

Some things I've worked out along the way:

* the VIF detach code in ironic is very simple, and just removes the tenant_vif_port_id field from the internal_info attribute of the ironic port to which the VIF is attached. This leads me to believe that nova is *not* calling this API during instance termination.

* the nova ironic virt driver's terminate method always ends up calling _unplug_vifs, so either terminate has not been called, it has not completed successfully, or the VIF was not present in the provided network_info object. So far my investigations have suggested the latter - network_info does not contain the VIF.

* there seems to be some level of raciness when deleting instances and their ports (VIFs) at similar times. The neutron vif unplugged event may not always call detach_interface[1] on the virt driver, but will remove the port from the instance info cache. This would cause the VIF to be absent from network_info during terminate.

Given that there seem to be multiple causes for this issue, one way to avoid the node becoming unusable would be to query the attached VIFs from ironic, as well as those in network_info when terminating an instance. Any unexpected VIFs could then be detached.

References
==========

[1] https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L1481

Darryl Weaver (dweaver) on 2017-11-22
Changed in nova:
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/537626

Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
status: Confirmed → In Progress

Change abandoned by Mark Goddard (<email address hidden>) on branch: master
Review: https://review.openstack.org/537626
Reason: I'm inclined to abandon this one. As various people have pointed out, it's a bit of a hack and should no longer be required. If we see cases where it would be useful, then there is probably a bug somewhere that needs addressing.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers