Bug #1749215 “Allocations not deleted on failed resize_instance” : Series queens : Bugs : OpenStack Compute (nova)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-13: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/543971

Changed in nova:
assignee:	nobody → Claudiu Belu (cbelu)
status:	New → In Progress

Matt Riedemann (mriedem) on 2018-02-13

tags:	added: placement resize
summary:	- Allocations not deleted on failed resize + Allocations not deleted on failed resize_instance
Changed in nova:
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-02-13:

#2

> Observe even after the instance has been destroyed, there is still 1 set of allocations for the instance.

To be clear, in queens when doing a cold migrate, we'll have a set of allocations in placement for the source and destination compute node providers. The source node allocations are tracked against the migration record, and the destination node allocations are tracked against the instance.

If resize_instance fails and we don't rollback allocations, when you delete the instance, it should remove it's allocations from the destination node:

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/compute/manager.py#L751

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/scheduler/client/report.py#L1628

However, the source node resource provider would still have allocations against it in placement tracked via the migration record, and those don't get removed if we failed to rollback when resize failed. When we delete the instance, we also delete the migration records associated with it:

https://github.com/openstack/nova/blob/fba4161f71e47e4155e7f9a1f0d3a41b8107cef5/nova/db/sqlalchemy/api.py#L1850

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2018-02-13:

#3

Download full text (5.8 KiB)

During reproduction I saw that following trace:

Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server [None req-d3fc1ed4-04fb-4072-8df2-9e014db3714c admin admin] Exception during message handling: ResizeError: Resize error: Unable to resize disk down.
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 76, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 67, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 186, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server s...

During reproduction I saw that following trace:

Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server [None req-d3fc1ed4-04fb-4072-8df2-9e014db3714c admin admin] Exception during message handling: ResizeError: Resize error: Unable to resize disk down.
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 76, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     function_name, call_dict, binary)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/exception_wrapper.py", line 67, in wrapped
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 186, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     "Error: %s", e, instance=instance)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 156, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/utils.py", line 976, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 214, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     self.force_reraise()
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 202, in decorated_function
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 4272, in resize_instance
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     migration, image, disk_info, migration.dest_compute)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     self.gen.throw(type, value, traceback)
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 7468, in _error_out_instance_on_exception
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server     raise error.inner_exception
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server ResizeError: Resize error: Unable to resize disk down.
Feb 11 00:38:53 ubuntu nova-compute[28931]: ERROR oslo_messaging.rpc.server

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2018-02-14:

#4

So what I understood so far:
* resize fails but as it was a cast the ResizeError exception does not propagated back hence the above trace => allocation consumed by the migration remains after the migration fails
* the migration is set to error state [1] and therefore it does not considered as ongoing by the resource tracker [2] periodic _update_available_resource calculation [3]. => allocation consumed by the migration doesn't cleaned up by the periodic _update_available_resource task.
* there is another periodic task specifically to clean up failed resizes _cleanup_incomplete_migrations [4] , it sees the failed resize but it only cares about deleted instances [5] and does not delete the allocation at all.
* when the instance is deleted we delete the migrations but not the possible allocations due to that migrations.

Don't we need to roll back the allocation in [1] when we set the migration to error? If the migration consuming allocations then stating that the migration is failed means we need to handle the allocation consumed by that migration if any.

[1]
https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L112
[2] https://github.com/openstack/nova/blob/1d3064c50b5784b174d611cb9e6e213fb02dac6f/nova/db/sqlalchemy/api.py#L4361
[3] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/resource_tracker.py#L724
[4] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L7655
[5] https://github.com/openstack/nova/blob/fa6c0f9cb14f1b4ce4d9b1dbacb1743173089986/nova/compute/manager.py#L7674

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-27: Fix merged to nova (master)

#5

Reviewed: https://review.openstack.org/543971
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=caf167862dd82e98f0189c9598856de57dfa7d35
Submitter: Zuul
Branch: master

commit caf167862dd82e98f0189c9598856de57dfa7d35
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

This patch adds allocation cleanup in the resize_instance and finish_resize
methods.

Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
Closes-Bug: #1749215

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-27: Fix proposed to nova (stable/queens)

#6

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/548300

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-28: Fix proposed to nova (stable/pike)

#7

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/548584

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-28: Fix merged to nova (stable/queens)

#8

Reviewed: https://review.openstack.org/548300
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5039511840bd64151f3111d9c8d7d8a01344193b
Submitter: Zuul
Branch: stable/queens

commit 5039511840bd64151f3111d9c8d7d8a01344193b
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

This patch adds allocation cleanup in the resize_instance and finish_resize
methods.

    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215
    (cherry picked from commit caf167862dd82e98f0189c9598856de57dfa7d35)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-07: Fix included in openstack/nova 17.0.1

#9

This issue was fixed in the openstack/nova 17.0.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-31: Fix merged to nova (stable/pike)

#10

Reviewed: https://review.openstack.org/548584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8f6371c868625351e3b017d6b2aaaa458cd73172
Submitter: Zuul
Branch: stable/pike

commit 8f6371c868625351e3b017d6b2aaaa458cd73172
Author: Claudiu Belu <email address hidden>
Date: Sun Feb 11 10:07:52 2018 -0800

compute: Cleans up allocations after failed resize

    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).

    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.

This patch adds allocation cleanup in the resize_instance and finish_resize
methods.

    Conflicts:
          nova/compute/manager.py
          nova/tests/functional/test_servers.py
          nova/tests/unit/compute/test_compute_mgr.py

NOTE(mriedem): The conflicts are due to not having these changes in Pike:

Id2f352a68f99df430d20634695be51f74db95d87

I4f0c99bb11d921d2875737c5cfbbb866be55c37c

Icc3ffe4037a44f4f323bec2f80d99ca226742e22

I9bcc1605945ddc35df2288be72e194c050b8ddd9

I7c3c95d4e0836e1af054d076aad29172574eab2c

    Because we don't have I7c3c95d4e0836e1af054d076aad29172574eab2c which
    added the _revert_allocation method, this backport has to use the
    ResourceTracker.delete_allocation_for_failed_resize method to rollback
    the *new* flavor allocation for the instance (the new flavor is the one
    being used for the resize). When we fail, we remove the new flavor
    allocation so the instance is left with it's current (old) flavor
    allocation on the source compute node. As a result, several unit tests
    must be updated to mock delete_allocation_for_failed_resize.

    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215
    (cherry picked from commit caf167862dd82e98f0189c9598856de57dfa7d35)
    (cherry picked from commit 5039511840bd64151f3111d9c8d7d8a01344193b)

Reviewed:  https://review.openstack.org/548584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8f6371c868625351e3b017d6b2aaaa458cd73172
Submitter: Zuul
Branch:    stable/pike

commit 8f6371c868625351e3b017d6b2aaaa458cd73172
Author: Claudiu Belu <cbelu@cloudbasesolutions.com>
Date:   Sun Feb 11 10:07:52 2018 -0800

compute: Cleans up allocations after failed resize
    
    During cold resize, the ComputeManager's prep_resize calls the
    rpcapi.ComputeAPI's resize_instance method, which will then do an
    RPC cast (async).
    
    Because the RPC cast is asynchronous, the exception branch in prep_resize
    will not be executed if the cold resize failed, the allocations will not
    be cleaned up, and the instance will not be rescheduled.
    
    This patch adds allocation cleanup in the resize_instance and finish_resize
    methods.
    
    Conflicts:
          nova/compute/manager.py
          nova/tests/functional/test_servers.py
          nova/tests/unit/compute/test_compute_mgr.py
    
    NOTE(mriedem): The conflicts are due to not having these changes in Pike:
    
      Id2f352a68f99df430d20634695be51f74db95d87
    
      I4f0c99bb11d921d2875737c5cfbbb866be55c37c
    
      Icc3ffe4037a44f4f323bec2f80d99ca226742e22
    
      I9bcc1605945ddc35df2288be72e194c050b8ddd9
    
      I7c3c95d4e0836e1af054d076aad29172574eab2c
    
    Because we don't have I7c3c95d4e0836e1af054d076aad29172574eab2c which
    added the _revert_allocation method, this backport has to use the
    ResourceTracker.delete_allocation_for_failed_resize method to rollback
    the *new* flavor allocation for the instance (the new flavor is the one
    being used for the resize). When we fail, we remove the new flavor
    allocation so the instance is left with it's current (old) flavor
    allocation on the source compute node. As a result, several unit tests
    must be updated to mock delete_allocation_for_failed_resize.
    
    Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
    Closes-Bug: #1749215
    (cherry picked from commit caf167862dd82e98f0189c9598856de57dfa7d35)
    (cherry picked from commit 5039511840bd64151f3111d9c8d7d8a01344193b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-02: Fix included in openstack/nova 16.1.1

#11

This issue was fixed in the openstack/nova 16.1.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix included in openstack/nova 18.0.0.0b1

#12

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-19:

#13

I think this possibly introduced a regression bug 1825537 so I've marked Ocata as won't fix for this bug.

OpenStack Compute (nova)

Allocations not deleted on failed resize_instance

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Claudiu Belu
Ocata	Won't Fix	Medium	Unassigned
Pike	Fix Committed	Medium	Matt Riedemann
Queens	Fix Committed	Medium	Claudiu Belu