Should count instances in build requests when check quotas

Bug #1716706 reported by Zhenyu Zheng
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned

Bug Description

In https://blueprints.launchpad.net/nova/+spec/cells-count-resources-to-check-quota-in-api we
changed to use new quota check workflows which counts all consumed resources across cells and
comparing with the limits.

But the resources in build_requests was not counted and this might
lead to creating 20 instances in concurrent requests, i'll probably get 10 in ERROR state rather
than 10 plus 10 failed requests with 409s. Which might be a big change to users.

We should also count instance records from build requests as consumed resources in order to
avoid allowing creation requests pass API layer checks and stopped in conductor layer check,
which lead instances to ERROR state.

Changed in nova:
assignee: nobody → Zhenyu Zheng (zhengzhenyu)
Revision history for this message
Matt Riedemann (mriedem) wrote :

We should be able to easily count build_requests in the API DB and include those in the count for instances, but the API also checks cores/ram when counting instances and we don't have that information in the build_request (well, we do, but it's in the serialized build_requests.instance field, which can't be queried via SQL).

And we can't use Placement to count VCPU/MEMORY_MB during the API check since the allocations aren't created in Placement yet (those get created when the scheduler picks a host).

So we might need to add vcpus/memory_mb fields on the build_requests table to mimic the instances table for counting those resources before the API casts to conductor.

Changed in nova:
status: New → Triaged
importance: Undecided → High
tags: added: quotas
Revision history for this message
Matt Riedemann (mriedem) wrote :

The other thing we have to handle here is if we count build_requests, we have to avoid double counting where the build_request and instance exist at the same time, which we know is the case when we recheck quota in conductor (after the instances are created in the cells but before the build requests are deleted). So only the API service can count build_requests *and* instances in cells, but conductor can only count instances in cells.

Revision history for this message
Matt Riedemann (mriedem) wrote :

melwitt made a good point that we'd still have to deal with avoiding a duplicate count of build_requests and instances regardless of the call from the API or conductor:

(10:21:33 AM) melwitt: mriedem: no, in the API where it is now. some other instances might be at the conductor stage and have both a build_request and an instance, right?
(10:21:46 AM) mriedem: oh yeah, good point
(10:21:51 AM) mriedem: separate concurrent request
(10:21:55 AM) mriedem: getting double counted
(10:21:55 AM) melwitt: yeah

What we could do is count the build_requests first and store a list of UUIDs, and when we count the instances, we exclude any that have a UUID in that list.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is a script that can be used to recreate this. Pass in the number of instances you want, I used 30 and had one instance go over the quota limit and fail the recheck in conductor.

#!/bin/bash

set -x

NUM=$1
IMAGE=$2

for i in `seq 1 $NUM`; do
   nova boot --flavor 1 --image $IMAGE test-over-quota-$i &
done

As a result of this, I found that the instance didn't actually go into ERROR state because of another bug:

Sep 13 17:58:26 devstack-queens nova-conductor[3129]: WARNING nova.scheduler.utils [None req-90a115b2-5838-4be2-afe2-a3b755015e19 demo demo] [instance: 888925b0-164a-4d4a-bb6c-c0426f904e95] Setting instance to ERROR state.: TooManyInstances: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances
Sep 13 17:58:26 devstack-queens nova-conductor[3129]: ERROR root [None req-90a115b2-5838-4be2-afe2-a3b755015e19 demo demo] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/nova/nova/conductor/manager.py", line 1003, in schedule_and_build_instances\n orig_num_req=len(build_requests))\n', ' File "/opt/stack/nova/nova/compute/utils.py", line 764, in check_num_instances_quota\n allowed=total_alloweds)\n', 'TooManyInstances: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances\n']: InstanceNotFound: Instance 888925b0-164a-4d4a-bb6c-c0426f904e95 could not be found.
Sep 13 17:58:26 devstack-queens nova-conductor[3129]: ERROR oslo_messaging.rpc.server [None req-90a115b2-5838-4be2-afe2-a3b755015e19 demo demo] Exception during message handling: InstanceNotFound: Instance 888925b0-164a-4d4a-bb6c-c0426f904e95 could not be found.
Sep 13 17:58:26 devstack-queens nova-conductor[3129]: ERROR oslo_messaging.rpc.server InstanceNotFound: Instance 888925b0-164a-4d4a-bb6c-c0426f904e95 could not be found.

Because we don't target the cell when updating the instance.

https://github.com/openstack/nova/blob/cfdec41eeec5fab220702efefdaafc45559aeb14/nova/conductor/manager.py#L1168

Changed in nova:
importance: High → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1717000 is the issue putting the instance into ERROR state mentioned in comment #4.

Revision history for this message
John Garbutt (johngarbutt) wrote :

@Zhenyu Zheng I am wondering if you are still working on a fix for this one? If not it might be good to drop the "Assigned to" bit. Many thanks.

Changed in nova:
assignee: Zhenyu Zheng (zhengzhenyu) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/515631
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f49ec409fd8fe2dc38d2978255c68850c427f94a
Submitter: Zuul
Branch: master

commit f49ec409fd8fe2dc38d2978255c68850c427f94a
Author: Kevin_Zheng <email address hidden>
Date: Fri Oct 27 15:35:05 2017 +0800

    Mention API behavior change when over quota limit

    In
    https://blueprints.launchpad.net/nova/+spec/cells-count-resources-to-check-quota-in-api
    we introduced a new workflow of Quota checks. It is possible that
    concurrent requests can pass API layer checks, but blocked by
    conductor layer checks.

    This can actually trigger user-noticeable API behavior changes:
    As an user, previously, If my request is blocked by quota checks, I will
    get HTTP 403 response, and no instance records will be left.

    After the above mentioned change, it is possible that when my requests
    failed at conductor layer Quota check and I got an instance in ERROR
    state. And in an busy cloud, users may got a lot of ERROR instances
    according to this and the instance number may beyond the limit.

    We should at least mention this behavior change in the release note.

    Change-Id: I05606fffab4e24fc55465067b66c6a035a787d1e
    Related-Bug: #1716706

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/525511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/525511
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=43dbbf8d877124145b277129cbb8a12a7a0e7a0d
Submitter: Zuul
Branch: stable/pike

commit 43dbbf8d877124145b277129cbb8a12a7a0e7a0d
Author: Kevin_Zheng <email address hidden>
Date: Fri Oct 27 15:35:05 2017 +0800

    Mention API behavior change when over quota limit

    In
    https://blueprints.launchpad.net/nova/+spec/cells-count-resources-to-check-quota-in-api
    we introduced a new workflow of Quota checks. It is possible that
    concurrent requests can pass API layer checks, but blocked by
    conductor layer checks.

    This can actually trigger user-noticeable API behavior changes:
    As an user, previously, If my request is blocked by quota checks, I will
    get HTTP 403 response, and no instance records will be left.

    After the above mentioned change, it is possible that when my requests
    failed at conductor layer Quota check and I got an instance in ERROR
    state. And in an busy cloud, users may got a lot of ERROR instances
    according to this and the instance number may beyond the limit.

    We should at least mention this behavior change in the release note.

    Change-Id: I05606fffab4e24fc55465067b66c6a035a787d1e
    Related-Bug: #1716706
    (cherry picked from commit f49ec409fd8fe2dc38d2978255c68850c427f94a)

tags: added: in-stable-pike
Revision history for this message
melanie witt (melwitt) wrote :

This bug came up again on an internal bug at my organization, so I took another look at this again.

I noticed that we create build requests *after* the initial quota limit check in the compute/api (which I think we should, to bail early if no room), so counting build requests in addition to instances (with UUID de-duping) isn't quite enough to handle the issue. I think we also need to relocate the CONF.quota.recheck_quota logic from conductor to compute/api, after we create the build request. I think this would result in a bit nicer code to keep the quota checking all in compute/api too.

Regarding the handling of VCPU and MEMORY_MB, I had another idea about how to do that by correlating request specs with build requests and using the flavor info in the request spec to count VCPU and MEMORY_MB, but that would be inefficient since we'd have to loop over each request spec's flavor info vs an efficient SQL query count if we were to add VCPU and MEMORY_MB columns to the build_requests table.

Revision history for this message
Matt Riedemann (mriedem) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.