If instance spawn fails and shutdown_instance also fails, a new exception is raised, masking original spawn failure

Bug #1506242 reported by Shraddha Pandhe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Shraddha Pandhe

Bug Description

When nova-compute, when building and running the instance, calls spawn on virt driver, spawn can fail for several reasons.

e.g. For Ironic, the spawn call can fail if deploy callback timeout happens.

If this call fails, nova-compute catches the exception, saves it for re-raising and calls shutdown_instance in a try-except block [1]. The problem is, if this shutdown_instance call also fails, a new exception 'BuildAbortException' is raised. This masks the original spawn failure.

This can cause problems for Ironic where, if deployment failed due to timeout, there is a good chance that shutdown_instance will also fail due to same reason, since it involves zapping etc. So original deployment failure will not be propagated back as instance fault.

[1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2171-L2191

Tags: compute
tags: added: compute
Revision history for this message
Mark Doffman (mjdoffma) wrote :

Could I get some more info about what the issues is with the exception not being re-raised properly?

What Exception is being masked? What issues is this causing in Ironic?

My reading of the code is:

_build_resources is called by _build_and_run_instance.

_build_and_run_instance deals with exception handling.

Some of these exceptions are then translated to a ResceduledException which will retry the instance creation. Others to a BuildAbortException.

Are you saying that the creation of the BuildAbortException is masking one of the exceptions that could should have been translated to a retry?

Revision history for this message
melanie witt (melwitt) wrote :

The linked code block:

                try:
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=deallocate_networks)
                except Exception:
                    ctxt.reraise = False
                    msg = _('Could not clean up failed build,'
                            ' not rescheduling')
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=msg)

sets ctxt.reraise = False and writes a new exception message, so the original exception won't be reraised by the context manager. I think maybe what we could do here is include both exception messages inside the BuildAbortException message.

Changed in nova:
importance: Undecided → Low
status: New → Confirmed
Changed in nova:
assignee: nobody → Pushkar Umaranikar (pushkar-umaranikar)
Revision history for this message
Shraddha Pandhe (shraddha-pandhe) wrote :

Hi Pushkar,

Sorry for assigning the bug to myself. I didn't realize that I forgot to assign it to myself before. I have already fixed this internally for Yahoo, so I am aware of the fix.

Changed in nova:
assignee: Pushkar Umaranikar (pushkar-umaranikar) → Shraddha Pandhe (shraddha-pandhe)
Revision history for this message
Shraddha Pandhe (shraddha-pandhe) wrote :

Mark, ill give an example

Case 1: spawn fails, shutdown_instance is successful

In this scenario, using Ironic, I simulated a boot failure by keeping deploy_callback_timeout value really really low. So the deployment failed, which triggered shutdown_instance. But shutdown_instance didn't fail. In this case, in nova show, I see

| fault | {"message": "Build of instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c was re-scheduled: Failed to provision instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c: Timeout reached while waiting for callback for node 96b3a7ff-df46-4211-94c3-a5570a0881aa", "code": 500, "details": " File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |

Case 2: spawn fails, shutdown_instance also fails

To simulate spawn failure, I used the same method as before. To simulate shutdown_instance failure, I added a raise statement

try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                try:
                    raise ValueError('FAKE') <<<
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=False)
                except Exception:
                    ctxt.reraise = False
                    msg = _('Could not clean up failed build,'
                            ' not rescheduling')
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=msg)

In this case, nova show output shows

| fault | {"message": "Build of instance 73f15dd0-3383-4fcd-bad9-90acc188806c aborted: Could not clean up failed build, not rescheduling", "code": 500, "details": " File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |
| | filter_properties)

Hence, the original timeout exception is masked.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/240347

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Shraddha Pandhe (shraddha-pandhe) wrote : Re: If instance spawn fails and shutdown_instance also fails, a new excpetion is raised, masking original spawn failure

@Melanie, @Mark,

Simplest fix would be to raise BuildAbortException with the original spawn failure, and simply log "Could not clean up failed build" line. Please see the code snippet [1]

Three reasons for doing this:

1. The only reason why the call landed there was because spawn failed. So knowing that error is more important than anything else
2. IMHO, "Could not clean up failed build" does not give any useful information about any failure. Its probably OK to log it, but not very useful to have that as instance fault.
3. Combining all of the statements is not possible because of the character limit of "message" field

[1]
        try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                # if network_info is empty we're likely here because of
                # network allocation failure. Since nothing can be reused on
                # rescheduling it's better to deallocate network to eliminate
                # the chance of orphaned ports in neutron
                deallocate_networks = False if network_info else True
                try:
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=deallocate_networks)
                except Exception:
                    ctxt.reraise = False
                    LOG.exception(_LE('Could not clean up failed build,' <<<<
                                                          ' not rescheduling'), <<<
                                                          instance=instance) <<<
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=exc) <<<

Matt Riedemann (mriedem)
summary: If instance spawn fails and shutdown_instance also fails, a new
- excpetion is raised, masking original spawn failure
+ exception is raised, masking original spawn failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/240347
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03e513b36e8d48726a90d592917fd780f1b36a3d
Submitter: Jenkins
Branch: master

commit 03e513b36e8d48726a90d592917fd780f1b36a3d
Author: Shraddha Pandhe <email address hidden>
Date: Thu Oct 29 22:38:20 2015 +0000

    Do not mask original spawn failure if shutdown_instance fails

    Change-Id: Id5e2bd556f00555c203c51d84784fe8db9fdcca2
    Closes-Bug: #1506242

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b1

This issue was fixed in the openstack/nova 13.0.0.0b1 development milestone.

Changed in nova:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.