Baremetal instance stuck in BUILD state following ironic node tear down or delete

Bug #1732506 reported by Mark Goddard
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

Description
===========

A baremetal (ironic) instance can become stuck in the BUILD state if the ironic node to which the instance has been assigned is either deleted or torn down manually while the instance is being built.

Steps to reproduce
==================

* Create a nova instance that will be scheduled onto baremetal.
* Determine to which node the instance has been scheduled via 'openstack baremetal node show --instance <instance UUID>'
* Wait for the ironic node to enter the 'wait call-back' state.
* Tear down the node manually via 'openstack baremetal node undeploy <node>'

Expected results
================

The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up, and moves to an ERROR state.

Actual results
==============

The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up the instance's networks, and stays in the BUILD state.

Environment
===========

Pike, deployed using kolla-ansible on CentOS host with RDO packages in CentOS containers.

openstack-nova-api-16.0.0-1.el7.noarch

Thoughts
========

I believe this is happening because the nova ironic virt driver raises InstanceNotFound [1][2] when the ironic node is deleted or torn down. The nova compute manager [3] interprets this as meaning the Nova instance was deleted, and therefore does not change the instance's state as there should be no instance to change.

[1] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/virt/ironic/driver.py#L188
[2] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/virt/ironic/driver.py#L490
[3] https://github.com/openstack/nova/blob/2aa5fb3385c5c15259e0c749c46371462789dc6d/nova/compute/manager.py#L1901

Tags: ironic
Revision history for this message
John Garbutt (johngarbutt) wrote :

Does seem like [2] should really be raising build failed, like the previous step.

The problem is comparing when you delete and instance via the API (and making sure we correctly keep the instance deleted) vs failing due to something else doing the delete during a build.

Either way, stuck in building state is the worst possible outcome.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Agreed, we definitely want to still allow an asynchronous instance delete during build. I think that is the scenario that InstanceNotFound was meant to cover, but it has been incorrectly overloaded by the ironic virt driver.

Ruby Loo (rloo)
tags: added: ironic
Revision history for this message
Mark Goddard (mgoddard) wrote :

Oddly, the main exception raised in the nova ironic virt driver when things go wrong is InstanceDeployFailure, but this is not referenced anywhere outside that driver except for the class definition in exception.py.

Revision history for this message
melanie witt (melwitt) wrote :

Reading through this, this bug looks valid to me. Setting to Medium based on the fact that a user has to delete an ironic node out-of-band from nova while the nova instance is building in order to encounter the bug.

The analysis in the bug report makes sense, that the ironic driver is raising InstanceNotFound when calling ironic after the baremetal node was deleted out-of-band of nova. Nova treats it as "nova instance not found" and thus thinks there's nothing to do with the instance state.

I do wonder if it would be correct for the ironic driver to instead raise InstanceDeployFailure (or another new exception such as IronicNodeNotFound) if an ironic node GET call returns 404. I can't think of a reason the ironic driver should raise InstanceNotFound unless it has deleted the nova instance itself.

This idea is based on looking at how we handle a delete via the nova API while a baremetal instance is building. While the instance is building in the driver spawn method, the _wait_for_active [1] method is looping. If a user requests a delete of the instance, the driver loop during build will see the task_state == DELETING and will raise InstanceDeployFailure as a result.

Then, the compute manager doesn't handle the InstanceDeployFailure exception [2] and will raise the RescheduledException. The RescheduledException will be caught [3] and when retries are exceeded, the networks/volumes will be cleaned up and the instance set to ERROR state.

[1] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/virt/ironic/driver.py#L466-L469
[2] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/compute/manager.py#L2199
[3] https://github.com/openstack/nova/blob/6bf11e1dc14afad78b11d980c2544a3dc41579ff/nova/compute/manager.py#L1932

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.