Bug #1506242 “If instance spawn fails and shutdown_instance also...” : Bugs : OpenStack Compute (nova)

Markus Zoeller (markus_z) (mzoeller) on 2015-10-15

tags:

added: compute

Revision history for this message

Mark Doffman (mjdoffma) wrote on 2015-10-19:

#1

Could I get some more info about what the issues is with the exception not being re-raised properly?

What Exception is being masked? What issues is this causing in Ironic?

My reading of the code is:

_build_resources is called by _build_and_run_instance.

_build_and_run_instance deals with exception handling.

Some of these exceptions are then translated to a ResceduledException which will retry the instance creation. Others to a BuildAbortException.

Are you saying that the creation of the BuildAbortException is masking one of the exceptions that could should have been translated to a retry?

Revision history for this message

melanie witt (melwitt) wrote on 2015-10-28:

#2

The linked code block:

                try:
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=deallocate_networks)
                except Exception:
                    ctxt.reraise = False
                    msg = _('Could not clean up failed build,'
                            ' not rescheduling')
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=msg)

sets ctxt.reraise = False and writes a new exception message, so the original exception won't be reraised by the context manager. I think maybe what we could do here is include both exception messages inside the BuildAbortException message.

Changed in nova:
importance:	Undecided → Low
status:	New → Confirmed

Pushkar Umaranikar (pushkar-umaranikar) on 2015-10-29

Changed in nova:
assignee:	nobody → Pushkar Umaranikar (pushkar-umaranikar)

Revision history for this message

Shraddha Pandhe (shraddha-pandhe) wrote on 2015-10-29:

#3

Hi Pushkar,

Sorry for assigning the bug to myself. I didn't realize that I forgot to assign it to myself before. I have already fixed this internally for Yahoo, so I am aware of the fix.

Changed in nova:
assignee:	Pushkar Umaranikar (pushkar-umaranikar) → Shraddha Pandhe (shraddha-pandhe)

Revision history for this message

Shraddha Pandhe (shraddha-pandhe) wrote on 2015-10-29:

#4

Mark, ill give an example

Case 1: spawn fails, shutdown_instance is successful

In this scenario, using Ironic, I simulated a boot failure by keeping deploy_callback_timeout value really really low. So the deployment failed, which triggered shutdown_instance. But shutdown_instance didn't fail. In this case, in nova show, I see

| fault | {"message": "Build of instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c was re-scheduled: Failed to provision instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c: Timeout reached while waiting for callback for node 96b3a7ff-df46-4211-94c3-a5570a0881aa", "code": 500, "details": " File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |

Case 2: spawn fails, shutdown_instance also fails

To simulate spawn failure, I used the same method as before. To simulate shutdown_instance failure, I added a raise statement

try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                try:
                    raise ValueError('FAKE') <<<
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=False)
                except Exception:
                    ctxt.reraise = False
                    msg = _('Could not clean up failed build,'
                            ' not rescheduling')
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=msg)

In this case, nova show output shows

| fault | {"message": "Build of instance 73f15dd0-3383-4fcd-bad9-90acc188806c aborted: Could not clean up failed build, not rescheduling", "code": 500, "details": " File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |
| | filter_properties)

Hence, the original timeout exception is masked.

Mark, ill give an example

Case 1: spawn fails, shutdown_instance is successful

In this scenario, using Ironic, I simulated a boot failure by keeping deploy_callback_timeout value really really low. So the deployment failed, which triggered shutdown_instance. But shutdown_instance didn't fail. In this case, in nova show, I see

| fault                                | {"message": "Build of instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c was re-scheduled: Failed to provision instance e2f95d21-b4fc-45e0-a122-7b44d295ad2c: Timeout reached while waiting for callback for node 96b3a7ff-df46-4211-94c3-a5570a0881aa", "code": 500, "details": "  File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |

Case 2: spawn fails, shutdown_instance also fails

To simulate spawn failure, I used the same method as before. To simulate shutdown_instance failure, I added a raise statement

try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                try:
                    raise ValueError('FAKE') <<< 
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=False)
                except Exception:
                    ctxt.reraise = False
                    msg = _('Could not clean up failed build,'
                            ' not rescheduling')
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=msg)

In this case, nova show output shows

| fault                                | {"message": "Build of instance 73f15dd0-3383-4fcd-bad9-90acc188806c aborted: Could not clean up failed build, not rescheduling", "code": 500, "details": "  File \"/nova/compute/manager.py\", line 2062, in _do_build_and_run_instance |
|                                      |     filter_properties)

Hence, the original timeout exception is masked.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-29: Fix proposed to nova (master)

#5

Fix proposed to branch: master
Review: https://review.openstack.org/240347

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Shraddha Pandhe (shraddha-pandhe) wrote on 2015-10-29: Re: If instance spawn fails and shutdown_instance also fails, a new excpetion is raised, masking original spawn failure

#6

@Melanie, @Mark,

Simplest fix would be to raise BuildAbortException with the original spawn failure, and simply log "Could not clean up failed build" line. Please see the code snippet [1]

Three reasons for doing this:

1. The only reason why the call landed there was because spawn failed. So knowing that error is more important than anything else
2. IMHO, "Could not clean up failed build" does not give any useful information about any failure. Its probably OK to log it, but not very useful to have that as instance fault.
3. Combining all of the statements is not possible because of the character limit of "message" field

[1]
        try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                # if network_info is empty we're likely here because of
                # network allocation failure. Since nothing can be reused on
                # rescheduling it's better to deallocate network to eliminate
                # the chance of orphaned ports in neutron
                deallocate_networks = False if network_info else True
                try:
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=deallocate_networks)
                except Exception:
                    ctxt.reraise = False
                    LOG.exception(_LE('Could not clean up failed build,' <<<<
                                                          ' not rescheduling'), <<<
                                                          instance=instance) <<<
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=exc) <<<

@Melanie, @Mark,

Simplest fix would be to raise BuildAbortException with the original spawn failure, and simply log "Could not clean up failed build" line. Please see the code snippet [1]

Three reasons for doing this:

1. The only reason why the call landed there was because spawn failed. So knowing that error is more important than anything else
2. IMHO, "Could not clean up failed build" does not give any useful information about any failure. Its probably OK to log it, but not very useful to have that as instance fault.
3. Combining all of the statements is not possible because of the character limit of "message" field

[1]
        try:
            yield resources
        except Exception as exc:
            with excutils.save_and_reraise_exception() as ctxt:
                if not isinstance(exc, (exception.InstanceNotFound,
                    exception.UnexpectedDeletingTaskStateError)):
                        LOG.exception(_LE('Instance failed to spawn'),
                                instance=instance)
                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)
                # if network_info is empty we're likely here because of
                # network allocation failure. Since nothing can be reused on
                # rescheduling it's better to deallocate network to eliminate
                # the chance of orphaned ports in neutron
                deallocate_networks = False if network_info else True
                try:
                    self._shutdown_instance(context, instance,
                            block_device_mapping, requested_networks,
                            try_deallocate_networks=deallocate_networks)
                except Exception:
                    ctxt.reraise = False
                    LOG.exception(_LE('Could not clean up failed build,' <<<<
                                                          ' not rescheduling'), <<<
                                                          instance=instance) <<<
                    raise exception.BuildAbortException(
                            instance_uuid=instance.uuid, reason=exc) <<<

Matt Riedemann (mriedem) on 2015-11-02

summary:

If instance spawn fails and shutdown_instance also fails, a new
- excpetion is raised, masking original spawn failure
+ exception is raised, masking original spawn failure

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-11: Fix merged to nova (master)

#7

Reviewed: https://review.openstack.org/240347
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03e513b36e8d48726a90d592917fd780f1b36a3d
Submitter: Jenkins
Branch: master

commit 03e513b36e8d48726a90d592917fd780f1b36a3d
Author: Shraddha Pandhe <email address hidden>
Date: Thu Oct 29 22:38:20 2015 +0000

Do not mask original spawn failure if shutdown_instance fails

Change-Id: Id5e2bd556f00555c203c51d84784fe8db9fdcca2
Closes-Bug: #1506242

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-12-02: Fix included in openstack/nova 13.0.0.0b1

#8

This issue was fixed in the openstack/nova 13.0.0.0b1 development milestone.

Doug Hellmann (doug-hellmann) on 2015-12-03

Changed in nova:
status:	Fix Committed → Fix Released

OpenStack Compute (nova)

If instance spawn fails and shutdown_instance also fails, a new exception is raised, masking original spawn failure

Bug Description

Other bug subscribers

Remote bug watches