Normally in a cluster of resources, we need enough quota to make sure update will success, but since we didn't have any way to control(or reserve) quota for all resources in the stack. We hit resource limit when update(which is fine because there is not enough quota to complete anyway) and become update failed. The Problem shows up when we ask that update to rollback (for Magnum cluster this situation always true), and it will fail on almost every time for a complex resource group because quota still held by other resources. Example, When we update a cluster from 20 nodes to 50 nodes. We stuck when updating node number 40 because we run out of resources. So we might have around 20 nodes required to roll back with update replace (for Magnum cluster this is always true), and another 20 nodes (number 21-40) needs to be deleted.
But in most cases roll back for first 20 nodes will likely fail since the other 20 nodes still held resource quota.
What we can do is to make the priority and make sure we delete resources before starting to update/create other resources.
Fix proposed to branch: master /review. openstack. org/499020
Review: https:/