quotas not updated when periodic tasks or startup finish deletes

Bug #1296414 reported by Chris Behrens
58
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dmitry Stepanenko
Juno
Fix Released
Undecided
Unassigned
Kilo
Fix Released
Undecided
Unassigned

Bug Description

There are a couple of cases in the compute manager where we don't pass reservations to _delete_instance(). For example, one of them is cleaning up when we see a delete that is stuck in DELETING.

The only place we ever update quotas as part of delete should be when the instance DB record is removed. If something is stuck in DELETING, it means that the quota was not updated. We should make sure we're always updating the quota when the instance DB record is removed.

Soft delete kinda throws a wrench in this, though, because I think you want soft deleted instances to not count against quotas -- yet their DB records will still exist. In this case, it seems we may have a race condition in _delete_instance() -> _complete_deletion() where if the instance somehow was SOFT_DELETED, quotas would have updated twice (once in soft_delete and once in _complete_deletion).

Chris Behrens (cbehrens)
description: updated
Chris Behrens (cbehrens)
description: updated
Tracy Jones (tjones-i)
tags: added: compute
melanie witt (melwitt)
Changed in nova:
importance: Undecided → High
status: New → Confirmed
jichenjc (jichenjc)
Changed in nova:
assignee: nobody → jichenjc (jichenjc)
Revision history for this message
jichenjc (jichenjc) wrote :

I went through the code and I think the scenario pointed out is the only routine that we need to take care
the others seems already handled (include soft delete)

Revision history for this message
jichenjc (jichenjc) wrote :

please take a look at scheduler/manager
there is a periodic function that will expire the reservation and destroy them
so we only 'reserve' and no 'rollback' or 'commit' actions
for this kind of quota because of 'DELETING' ,we might let system expire and destroy them ...

Revision history for this message
jichenjc (jichenjc) wrote :

_reclaim_queued_deletes which handles soft_delete instances will not commit or rollback the quota
so that means only 1 quota action are performed during soft-delete stage ....

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is also easily triggered by bug 1299139.

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/89318

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/89761

Revision history for this message
jichenjc (jichenjc) wrote :

some discussion topics:

comments in _complete_deletion

        # FIXME(comstud): See bug 1296414. Quotas here should only be
        # committed if the instance is not SOFT_DELETED. If there were
        # some sort of race and the instance was SOFT_DELETED, we should
        # not commit quotas, as they would have already been done in
        # soft_delete().

I think we are not able to enter into this condition because _complete_deletion was called by _complete_partial_deletion and
_delete_instance

only if the instance state is DELETED you can enter _complete_partial_deletion
if we can call _complete_deletion in _delete_instance, that means instance state already become DELETED
(instance.vm_state = vm_states.DELETED)

also, if the _delete_instance was called with instance state SOFT_DELETED, the quotas.rollback() will be called
it will lead quotas.commit later take no effect since the quotas.reservations is already None

so I think it's not a problem

Revision history for this message
jichenjc (jichenjc) wrote :

in _reclaim_queued_deletes there is
        # FIXME(comstud): Dummy quota object for now. See bug 1296414.
        # We have potential for inconsistency if we decide here to not
        # update quotas. _delete_instance() should determine whether or
        # not to update quotas based on if instance is in a SOFT_DELETED
        # state or not.

I think it's reasonable
soft_delete_instance will update the instance to soft_delete first then commit the quota
_init_instance will not handle SOFT_DELETE instance and leave it to _reclaim_queued_deletes

in this case, we need to commit the quota , I will submit a patch for it

let me know your comments, thanks

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/90280

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/90280
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=29f2953edf2dc95569d53433d7c28251f652f8d4
Submitter: Jenkins
Branch: master

commit 29f2953edf2dc95569d53433d7c28251f652f8d4
Author: jichenjc <email address hidden>
Date: Wed Apr 23 06:17:02 2014 +0800

    Remove comments and to-do for quota inconsistency

    There are comments for quota inconsistency in function
    _complete_deletion and considering race conditions.

    _complete_deletion was called by _complete_partial_deletion
    and_delete_instance only if the instance state is DELETED
    compute can enter _complete_partial_deletion
    while if _complete_deletion called in _delete_instance,
    that means instance vm state already become DELETED

    also, if the _delete_instance was called with instance
    state SOFT_DELETED, the quotas.rollback() will be called
    it will make quotas.commit later take no effect
    since the quotas.reservations is already None

    so there is no race conditions and the comments can be removed

    Change-Id: Iddb9e5197ca7ca9bd4692d63bb5c7f9a2ab44be5
    Partial-Bug: #1296414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93374

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/93374
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1b35a3b263f47558e2e7791e27810741881eea43
Submitter: Jenkins
Branch: master

commit 1b35a3b263f47558e2e7791e27810741881eea43
Author: jichenjc <email address hidden>
Date: Mon May 12 20:27:43 2014 +0800

    Change the comments of SOFT_DELETED race condition

    _reclaim_queued_deletes will only find instances in SOFT_DELETED
    state and delete them if they are only old enough. The quotas will be
    committed when the instance was soft-deleted so that resource can
    be used right after the soft-delete operation.

    There are some concern about the quota inconsistency for the operation,
    The only case that the quota might be inconsistent is
    the compute node died between set instance state to SOFT_DELETED
    and quota commit to DB; when compute node start again
    it will have no idea the reservation is committed or not or even
    expired, since it's a rare case, so marked as todo.

    There are some alternatives if the problem need to be fixed, e.g.
    able to find reservation when nova compute restart, so we can double
    check whether the quota is committed or not.

    Partial-Bug: #1296414

    Change-Id: Idf9c179b2dd439462a646568ffd5098cd5d7851f

Revision history for this message
jichenjc (jichenjc) wrote :

Tailor, Rajesh can reproduce the error for deleting so I unassigned myself and Tailor will continue to work on it

Changed in nova:
assignee: jichenjc (jichenjc) → nobody
Changed in nova:
assignee: nobody → Rajesh Tailor (rajesh-tailor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/122347

Sean Dague (sdague)
tags: added: quotas
Matt Riedemann (mriedem)
tags: added: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/89761
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Patrick Crews (patrick-crews) wrote :
Revision history for this message
Rajesh Tailor (rajesh-tailor) wrote :

Hi Patrick,
The bugs mentioned by you:
1) https://bugs.launchpad.net/nova/+bug/1284424
There is not enough information on bug description/comments through which I can identify whether it is related to current bug (bug #1296414) or not.
2) https://bugs.launchpad.net/nova/+bug/1333145
I'm not sure, but IMO this might be a case mentioned in current bug (bug #1296414) description, where quotas were not getting updated properly (i.e. updated quota everytime for subsequent soft-delete requests).

As mentioned in current bug (bug #1296414) description there were couple of cases where quota was not getting updated properly
while deleting the instance.
As of now, there is only one case when quota is not getting updated properly which is, when instance stuck in 'deleting' task_state.
After that on restart of nova-compute service, init_instance method tries to delete instances stuck in 'deleting' task_state and
doesn't update quota properly.

My patch addresses the issue when quotas are not getting updated properly during deleting instances stuck in 'deleting' task_state on
restart of nova-compute service.
please refer: https://review.openstack.org/122347

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/89761
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Removing "In Progress" status and assignee as change is abandoned.

Changed in nova:
status: In Progress → Confirmed
assignee: Rajesh Tailor (rajesh-tailor) → nobody
Revision history for this message
Rajesh Tailor (rajesh-tailor) wrote :

Hi Davanum,

The following patch is abandoned, which is submitted earlier by another assignee.
https://review.openstack.org/89761

I had already submitted patch to address this issue and it is still under review.
https://review.openstack.org/122347

Changed in nova:
assignee: nobody → Rajesh Tailor (rajesh-tailor)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Joe Gordon (jogo) wrote :

Marking this as critical since this affects all users and quota issues were reported as being one of our top pain points

Changed in nova:
milestone: none → kilo-3
importance: High → Critical
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/170118

Revision history for this message
John Garbutt (johngarbutt) wrote :

given the current state of this conversation, I just can't see us fixing this over the next few days :'(

I really want to try help fix these issues. See me on IRC @johnthetubaguy

tags: added: kilo-rc-potential
Changed in nova:
milestone: kilo-rc1 → none
importance: Critical → High
tags: added: kilo-backport-potential
removed: juno-backport-potential
Changed in nova:
milestone: none → liberty-1
Revision history for this message
John Garbutt (johngarbutt) wrote :

this feels too risky for RC2 given the size of the fix.

Thierry Carrez (ttx)
tags: removed: kilo-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/122347
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=71e99b1d56627e8a929f21b88e672f8ee34a53a1
Submitter: Jenkins
Branch: master

commit 71e99b1d56627e8a929f21b88e672f8ee34a53a1
Author: Rajesh Tailor <email address hidden>
Date: Tue Sep 16 05:50:47 2014 -0700

    Fix quota-update of instances stuck in deleting when nova-compute startup finish

    Quotas are not updated correctly for the instances whose task_state is
    in deleting status during the nova-compute init_host call.

    Added code to pass correct quota to the _delete_instance method.

    Change-Id: Ida84d2d49d46540e0581dc3a58844c30bc1d2cff
    Partial-Bug: 1296414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/188362

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/189563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/189563
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2635ee08266858a564596ca7393e11be4e050606
Submitter: Jenkins
Branch: stable/kilo

commit 2635ee08266858a564596ca7393e11be4e050606
Author: Rajesh Tailor <email address hidden>
Date: Tue Sep 16 05:50:47 2014 -0700

    Fix quota-update stuck in deleting when nova-compute startup finish

    Quotas are not updated correctly for the instances whose task_state is
    in deleting status during the nova-compute init_host call.

    Added code to pass correct quota to the _delete_instance method.

    (cherry picked from commit 71e99b1d56627e8a929f21b88e672f8ee34a53a1)

    Change-Id: Ida84d2d49d46540e0581dc3a58844c30bc1d2cff
    Partial-Bug: 1296414

tags: added: in-stable-kilo
Revision history for this message
John Garbutt (johngarbutt) wrote :

seems like the fix has merged in master, so updated the status of the bug.

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/188362
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=90793b9d397e2987cfd0e658dd64eaa035c14ed7
Submitter: Jenkins
Branch: stable/juno

commit 90793b9d397e2987cfd0e658dd64eaa035c14ed7
Author: Rajesh Tailor <email address hidden>
Date: Tue Sep 16 05:50:47 2014 -0700

    Fix quota-update in deleting when nova-compute startup finish

    Quotas are not updated correctly for the instances whose task_state is
    in deleting status during the nova-compute init_host call.

    Added code to pass correct quota to the _delete_instance method.

    (cherry picked from commit 2635ee08266858a564596ca7393e11be4e050606)

    Change-Id: Ida84d2d49d46540e0581dc3a58844c30bc1d2cff
    Partial-Bug: 1296414

tags: added: in-stable-juno
Revision history for this message
John Garbutt (johngarbutt) wrote :

Apparently this is a two part fix, the second part is here:
https://review.openstack.org/#/c/170118

Changed in nova:
status: Fix Released → In Progress
milestone: liberty-1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/212374

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/212427

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Rajesh Tailor (<email address hidden>) on branch: master
Review: https://review.openstack.org/212427
Reason: It is duplicate patch of change Id: I424e31c39baa4d7c96bb29687cfa45b7a6092e32

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/212374
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=60411ddf1bada3629f0588de8726c63b5ab00357
Submitter: Jenkins
Branch: master

commit 60411ddf1bada3629f0588de8726c63b5ab00357
Author: Rajesh Tailor <email address hidden>
Date: Fri Aug 7 06:26:59 2015 -0700

    Move quota delta reserve methods from api to utils

    Moved quota delta reserve methods from compute/api to compute/utils, so
    that it can be used by compute/manager in dependent patch.

    Change-Id: Idb02104764d490951949a74a9c0d45b532ee7782
    Related-Bug: #1296414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170118
Reason: It seems this patch isn't being actively worked. I am therefore abandoning it. Feel free to restore when you're ready.

Changed in nova:
assignee: Rajesh Tailor (rajesh-tailor) → nobody
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → Dmitry Stepanenko (dstepanenko)
Matt Riedemann (mriedem)
tags: removed: kilo-backport-potential
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Solving an inconsistency: This bug report has an assignee and it looks
like this could result in a patch. Therefore I switch the status to
"In Progress".
NOTE: This is tightly coupled to bug 1284424 which also got discussed on the ML with http://lists.openstack.org/pipermail/openstack-dev/2016-April/092249.html

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

The master patch seems to have merged

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.