[ocata only] quota usage not decremented during boot/delete race

Bug #1783613 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
Ocata
Fix Committed
Undecided
melanie witt

Bug Description

A customer ran into a situation where, during rapid boot/delete (parallel) requests to nova, they noticed that quota usage was not decremented after deleting instances. So, the number of instances in use did not match the quota usage (quota out-of-sync).

I noticed in the code that in the _delete_while_booting method, we check for the presence of the build request and if it's found, we delete it and we lookup the instance and commit the quota usage decrement if we found the instance. However, we *don't* decrement the quota usage via commit if we did not find the instance (after finding the build request).

I think that's wrong because if we found the build request, we're in the middle of booting, and if we don't find the instance after that, it means conductor either a) didn't create the instance record yet or b) deleted the instance record because it got BuildRequestNotFound because we (compute/api) deleted the build request. In either case, conductor isn't going to do anything to decrement the quota usage, so we need to do it in compute/api.

I think the fix is to decrement the quota usage whether we find the instance or not, after we successfully delete the build request.

Revision history for this message
melanie witt (melwitt) wrote :

This bug does not exist on the master branch.

Changed in nova:
status: Triaged → Invalid
assignee: melanie witt (melwitt) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/588416

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/588416
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3120f40f636ab4cbdcb9f5cd4e9175e99e1e1c63
Submitter: Zuul
Branch: stable/ocata

commit 3120f40f636ab4cbdcb9f5cd4e9175e99e1e1c63
Author: melanie witt <email address hidden>
Date: Fri Aug 3 00:45:04 2018 +0000

    [stable only] Add functional regression test for bug 1783613

    Related-Bug: #1783613

    Change-Id: Ib65b0c7c5b83c53edf63f23fa522334c6a666ae4

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/582413
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=456b3d68bc491a6965316e6a4f3fc11dd509237b
Submitter: Zuul
Branch: stable/ocata

commit 456b3d68bc491a6965316e6a4f3fc11dd509237b
Author: melanie witt <email address hidden>
Date: Thu Jul 12 22:42:19 2018 +0000

    [stable only] Handle quota usage during create/delete races

    There is a race during an instance delete while booting where, if the
    build request was found, there is still a chance we can fail to lookup
    the instance record and if so, we need to go ahead and commit the quota
    decrement reservations. We're currently skipping the quotas.commit() if
    we *don't* find the instance record. I believe this is a regression as of
    commit 361b88383130be8da9d474d02a9f62136239d506.

    There are two reasons why we may fail to find the instance record after
    we successfully found the build request:

      a) Conductor didn't create the instance record yet.
      b) Conductor deleted the instance record after finding the build
         request was deleted by us (the API)

    In either case, conductor doesn't do anything to decrement quota usage
    so we need to do it in the compute/api.

    There is a second race where we find the build request and succeed in
    looking up the instance record but then fail to delete the instance
    record (instance.destroy() raises InstanceNotFound). This can happen if

      c) Conductor deleted the instance record after finding the build
         request was deleted by us but after we looked up the instance
      d) If we (the API) are racing with another delete request

    If c) happens, we need to quotas.commit(). If d) happens, we need to
    quotas.rollback(). Since we can't know which case it was when we get
    InstanceNotFound, we'll do a quotas.rollback() to get rid of the
    reservation record and force a refresh of quota usages to make sure
    we end with correct usage in case rollback was not the right choice.

    Closes-Bug: #1783613

    Change-Id: I4d492dcad193ff0b1202dcae97954cc65b75c109

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.4

This issue was fixed in the openstack/nova 15.1.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.