resource group fail to rollback when `to be deleted` resources taken quota

Bug #1713900 reported by Rico Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
In Progress
High
Rico Lin

Bug Description

Normally in a cluster of resources, we need enough quota to make sure update will success, but since we didn't have any way to control(or reserve) quota for all resources in the stack. We hit resource limit when update(which is fine because there is not enough quota to complete anyway) and become update failed. The Problem shows up when we ask that update to rollback (for Magnum cluster this situation always true), and it will fail on almost every time for a complex resource group because quota still held by other resources. Example, When we update a cluster from 20 nodes to 50 nodes. We stuck when updating node number 40 because we run out of resources. So we might have around 20 nodes required to roll back with update replace (for Magnum cluster this is always true), and another 20 nodes (number 21-40) needs to be deleted.
But in most cases roll back for first 20 nodes will likely fail since the other 20 nodes still held resource quota.

What we can do is to make the priority and make sure we delete resources before starting to update/create other resources.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/499020

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/505811

Changed in heat:
assignee: Rico Lin (rico-lin) → Zane Bitter (zaneb)
Zane Bitter (zaneb)
Changed in heat:
assignee: Zane Bitter (zaneb) → Rico Lin (rico-lin)
Changed in heat:
assignee: Rico Lin (rico-lin) → Zane Bitter (zaneb)
Zane Bitter (zaneb)
Changed in heat:
assignee: Zane Bitter (zaneb) → Rico Lin (rico-lin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/510290

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Rico Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/510290

Rico Lin (rico-lin)
Changed in heat:
milestone: queens-1 → queens-2
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-2 → queens-3
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-3 → queens-rc1
Changed in heat:
assignee: Rico Lin (rico-lin) → Thomas Herve (therve)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/505811
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=09d74ffa3cb55e62318e0cd9eac10a9cb0c1a70a
Submitter: Zuul
Branch: master

commit 09d74ffa3cb55e62318e0cd9eac10a9cb0c1a70a
Author: Zane Bitter <email address hidden>
Date: Wed Sep 20 14:24:46 2017 -0400

    Prioritise resource deletion over creation

    Because of quotas, there are times when creating a resource and then
    deleting another resource may fail where doing it in the reverse order
    would work, even though the resources are independent of one another.

    When enqueueing 'check_resource' messages, send those for cleanup nodes
    prior to those for update nodes. This means that all things being equal
    (i.e. no dependency relationship), deletions will be started first. It
    doesn't guarantee success when quotas allow, since only a dependency
    relationship will cause Heat to wait for the deletion to complete before
    starting creation, but it is a risk-free way to give us a better chance of
    succeeding.

    Change-Id: I9727d906cd0ad8c4bf9c5e632a47af6d7aad0c72
    Partial-Bug: #1713900

Revision history for this message
Zane Bitter (zaneb) wrote :

We merged a possible improvement in Queens. Whether to go all the way and alter the dependency graph is something there isn't consensus on. Moving this bug to Rocky.

Changed in heat:
milestone: queens-rc1 → next
assignee: Thomas Herve (therve) → Rico Lin (rico-lin)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.