OpenStack Compute (nova)

Baremetal instance stuck in BUILD state following ironic node tear down or delete

Bug #1732506 reported by Mark Goddard on 2017-11-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Medium	Unassigned

Bug Description

Description
===========

A baremetal (ironic) instance can become stuck in the BUILD state if the ironic node to which the instance has been assigned is either deleted or torn down manually while the instance is being built.

Steps to reproduce
==================

* Create a nova instance that will be scheduled onto baremetal.
* Determine to which node the instance has been scheduled via 'openstack baremetal node show --instance <instance UUID>'
* Wait for the ironic node to enter the 'wait call-back' state.
* Tear down the node manually via 'openstack baremetal node undeploy <node>'

Expected results
================

The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up, and moves to an ERROR state.

Actual results
==============

The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up the instance's networks, and stays in the BUILD state.

Environment
===========

Pike, deployed using kolla-ansible on CentOS host with RDO packages in CentOS containers.

openstack-nova-api-16.0.0-1.el7.noarch

Thoughts
========

I believe this is happening because the nova ironic virt driver raises InstanceNotFound [1][2] when the ironic node is deleted or torn down. The nova compute manager [3] interprets this as meaning the Nova instance was deleted, and therefore does not change the instance's state as there should be no instance to change.

[1] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/virt/ironic/driver.py#L188
[2] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/virt/ironic/driver.py#L490
[3] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/compute/manager.py#L1901

Tags:

Revision history for this message

John Garbutt (johngarbutt) wrote on 2017-11-15:

Does seem like [2] should really be raising build failed, like the previous step.

The problem is comparing when you delete and instance via the API (and making sure we correctly keep the instance deleted) vs failing due to something else doing the delete during a build.

Either way, stuck in building state is the worst possible outcome.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2017-11-15:

Agreed, we definitely want to still allow an asynchronous instance delete during build. I think that is the scenario that InstanceNotFound was meant to cover, but it has been incorrectly overloaded by the ironic virt driver.

Ruby Loo (rloo) on 2017-11-15

tags:

added: ironic

Revision history for this message

Mark Goddard (mgoddard) wrote on 2017-11-15:

Oddly, the main exception raised in the nova ironic virt driver when things go wrong is InstanceDeployFailure, but this is not referenced anywhere outside that driver except for the class definition in exception.py.

Revision history for this message

melanie witt (melwitt) wrote on 2018-10-10:

Reading through this, this bug looks valid to me. Setting to Medium based on the fact that a user has to delete an ironic node out-of-band from nova while the nova instance is building in order to encounter the bug.

The analysis in the bug report makes sense, that the ironic driver is raising InstanceNotFound when calling ironic after the baremetal node was deleted out-of-band of nova. Nova treats it as "nova instance not found" and thus thinks there's nothing to do with the instance state.

I do wonder if it would be correct for the ironic driver to instead raise InstanceDeployFailure (or another new exception such as IronicNodeNotFound) if an ironic node GET call returns 404. I can't think of a reason the ironic driver should raise InstanceNotFound unless it has deleted the nova instance itself.

This idea is based on looking at how we handle a delete via the nova API while a baremetal instance is building. While the instance is building in the driver spawn method, the _wait_for_active [1] method is looping. If a user requests a delete of the instance, the driver loop during build will see the task_state == DELETING and will raise InstanceDeployFailure as a result.

Then, the compute manager doesn't handle the InstanceDeployFailure exception [2] and will raise the RescheduledException. The RescheduledException will be caught [3] and when retries are exceeded, the networks/volumes will be cleaned up and the instance set to ERROR state.

[1] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/virt/ironic/driver.py#L466-L469
[2] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/compute/manager.py#L2199
[3] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/compute/manager.py#L1932

Changed in nova:
importance:	Undecided → Medium
status:	New → Confirmed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.