Allocations not deleted on failed resize_instance

Bug #1749215 reported by Claudiu Belu
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Claudiu Belu
Ocata
Won't Fix
Medium
Unassigned
Pike
Fix Committed
Medium
Matt Riedemann
Queens
Fix Committed
Medium
Claudiu Belu

Bug Description

Description
===========

During a resize, an instance's allocations are removed and replaced by 2 sets of allocations instead. If a resize is completed sucessfully, one set of allocations is correctly removed, but in case of a failure, neither set of allocations is removed. Only one set of allocations are removed if the instance is deleted.

This happens because the call self.compute_rpcapi.resize_instance [1] is an RPC cast (async), instead of a call (sync). Because of this, the Except branch [2] in which the allocation is cleared and the instance is rescheduled, is never called.

Additionally, because not all of the allocations are cleared, the resources on the compute nodes will become "locked" and unusable. At some point, instances will no longer be scheduled to those compute nodes, due to all the resources being "allocated".

[1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4085
[2] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4123

Steps to reproduce
==================

* Spawn an instance.
* Observe that the table nova_api.allocations only has 1 set of allocations for the instance (VCPU, MEMORY, DISK).
* Cold resize to an invalid flavor (e.g.: smaller disk).
* Observe that the table nova_api.allocations has 2 sets of allocations for the instance.
* Observe that the cold resize failed, and that the instance's task state has been reverted to its original state.
* Observe that the table nova_api.allocations continues to have 2 sets of allocations.
* Delete the instance.
* Observe even after the instance has been destroyed, there is still 1 set of allocations for the instance.

Expected result
===============

After the cold resize failed, there should be only 1 set of allocations in the nova_api.allocations table, and after deleting the instance, there shouldn't be any.

Actual result
=============

After the cold resize failed, there are 2 sets of allocations in the nova_api.allocations table, after deleting the instance, there is 1 set of allocations.

Environment
===========

Branch: Queens
Hypervisor: Hyper-V Server 2012 R2 (unrelated)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/543971

Changed in nova:
assignee: nobody → Claudiu Belu (cbelu)
status: New → In Progress
Matt Riedemann (mriedem)
tags: added: placement resize
summary: - Allocations not deleted on failed resize
+ Allocations not deleted on failed resize_instance
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

> Observe even after the instance has been destroyed, there is still 1 set of allocations for the instance.

To be clear, in queens when doing a cold migrate, we'll have a set of allocations in placement for the source and destination compute node providers. The source node allocations are tracked against the migration record, and the destination node allocations are tracked against the instance.

If resize_instance fails and we don't rollback allocations, when you delete the instance, it should remove it's allocations from the destination node:

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/compute/manager.py#L751

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/scheduler/client/report.py#L1628

However, the source node resource provider would still have allocations against it in placement tracked via the migration record, and those don't get removed if we failed to rollback when resize failed. When we delete the instance, we also delete the migration records associated with it:

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/db/sqlalchemy/api.py#L1850

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (5.8 KiB)

During reproduction I saw that following trace:

Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server [None req-d3fc1ed4-04fb-4072-8df2-9e014db3714c admin admin] Exception during message handling: ResizeError: Resize error: Unable to resize disk down.
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 76, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 67, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 186, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server s...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

So what I understood so far:
* resize fails but as it was a cast the ResizeError exception does not propagated back hence the above trace => allocation consumed by the migration remains after the migration fails
* the migration is set to error state [1] and therefore it does not considered as ongoing by the resource tracker [2] periodic _update_available_resource calculation [3]. => allocation consumed by the migration doesn't cleaned up by the periodic _update_available_resource task.
* there is another periodic task specifically to clean up failed resizes _cleanup_incomplete_migrations [4] , it sees the failed resize but it only cares about deleted instances [5] and does not delete the allocation at all.
* when the instance is deleted we delete the migrations but not the possible allocations due to that migrations.

Don't we need to roll back the allocation in [1] when we set the migration to error? If the migration consuming allocations then stating that the migration is failed means we need to handle the allocation consumed by that migration if any.

[1]
https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L112
[2] https://github.com/openstack/nova/blob/1d3064c50b5784b174d611cb9e6e213fb02dac6f/nova/db/sqlalchemy/api.py#L4361
[3] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/resource_tracker.py#L724
[4] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L7655
[5] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L7674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/543971
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=caf167862dd82e98f0189c9598856de57dfa7d35
Submitter: Zuul
Branch: master

commit caf167862dd82e98f0189c9598856de57dfa7d35
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

    compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

    This patch adds allocation cleanup in the resize_instance and finish_resize
    methods.

    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/548300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/548584

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/548300
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5039511840bd64151f3111d9c8d7d8a01344193b
Submitter: Zuul
Branch: stable/queens

commit 5039511840bd64151f3111d9c8d7d8a01344193b
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

    compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

    This patch adds allocation cleanup in the resize_instance and finish_resize
    methods.

    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215
    (cherry picked from commit caf167862dd82e98f0189c9598856de57dfa7d35)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.1

This issue was fixed in the openstack/nova 17.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/548584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8f6371c868625351e3b017d6b2aaaa458cd73172
Submitter: Zuul
Branch: stable/pike

commit 8f6371c868625351e3b017d6b2aaaa458cd73172
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

    compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

    This patch adds allocation cleanup in the resize_instance and finish_resize
    methods.

    Conflicts:
          nova/compute/manager.py
          nova/tests/functional/test_servers.py
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The conflicts are due to not having these changes in Pike:

      Id2f352a68f99df430d20634695be51f74db95d87

      I4f0c99bb11d921d2875737c5cfbbb866be55c37c

      Icc3ffe4037a44f4f323bec2f80d99ca226742e22

      I9bcc1605945ddc35df2288be72e194c050b8ddd9

      I7c3c95d4e0836e1af054d076aad29172574eab2c

    Because we don't have I7c3c95d4e0836e1af054d076aad29172574eab2c which
    added the _revert_allocation method, this backport has to use the
    ResourceTracker.delete_allocation_for_failed_resize method to rollback
    the *new* flavor allocation for the instance (the new flavor is the one
    being used for the resize). When we fail, we remove the new flavor
    allocation so the instance is left with it's current (old) flavor
    allocation on the source compute node. As a result, several unit tests
    must be updated to mock delete_allocation_for_failed_resize.

    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215
    (cherry picked from commit caf167862dd82e98f0189c9598856de57dfa7d35)
    (cherry picked from commit 5039511840bd64151f3111d9c8d7d8a01344193b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.1

This issue was fixed in the openstack/nova 16.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think this possibly introduced a regression bug 1825537 so I've marked Ocata as won't fix for this bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.