openstack core count may be inaccurate

Bug #2067628 reported by Paride Legovini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
In Progress
High
Skia

Bug Description

Its has been reported (thanks ginggs) that failures can happen in autopkgtest-cloud with error:

  Quota exceeded for cores: Requested 2, but already used 514 of 515 cores (HTTP 403)

Full log at [1]. I checked number of running VMs shortly after, and I counted less than 400 cores in use (taking into account that autopkgtest-big instances have 4 cores).

This may be a side effect of dropping the flock [2]. It may be that the instance deletion is asynchronous, and cores are freed only after the delete operation is complete.

We should do something like:

1. Figure out a way to query openstack for the current quota usage, and check how it matches the number of running VMs.
2. Check if in the worker we can do something like instance.delete(wait=True) so that we want for the VM to be deleted before proceeding. I made that option up, but given that the CLI tool has a wait parameter, delete() is likely to also have something like that.
3. Check whether this improves the comparison of point (1.)

[1] https://autopkgtest.ubuntu.com/results/autopkgtest-xenial/xenial/i386/u/ubuntu-advantage-tools/20240528_152118_2d212@/log.gz
[2] https://salsa.debian.org/ubuntu-ci-team/autopkgtest/-/commit/49f5760dddcdf7b3f70c177f3000391d1db0dbdd

Tags: adt-564

Related branches

Changed in auto-package-testing:
importance: Undecided → High
tags: added: adt-564
Skia (hyask)
Changed in auto-package-testing:
assignee: nobody → Skia (hyask)
Revision history for this message
Skia (hyask) wrote :

I've built a tiny script to compare what is reported by the quota and a manual count.

It prints the quota, then the count, then the quota again, because since it can take a bit of time to compute the manual count, this shows quickly if there are inconsistencies between the first and second quota displayed.

During the manual count, it will also print any instance that is not either `ACTIVE` or `ERROR`, because those are two very common status, and it has been verified that they are correctly counted in the quota. `SHUTOFF` has also been verified as counting in the quota, but is sufficiently rare as to not make too much noise, and that also helps in cleaning up those usually old VMs.
So far, all the printed VMs are in the `BUILD` state, and since I've always observed inconsistencies between the reported quota and the manual count, I guess they got added to the quota during this state.

Here are example outputs from the script, with some comments:

# lcy02:

quota: {'core': 92, 'instance': 43}
adt-noble-amd64-liblocale-us-perl-20231129-134245-juju-7f2275-prod-proposed-migration-environment-3 - BUILD - autopkgtest - 2
adt-focal-amd64-nvidia-graphics-drivers-525-20231129-140832-juju-7f2275-prod-proposed-migration-environment-2 - BUILD - autopkgtest - 2
adt-noble-amd64-systemd-upstream-20240625-110038-juju-7f2275-prod-proposed-migration-environment-3-2792f97d-63b5-4c80-a259-8ef43a4d062e - BUILD - autopkgtest - 2
adt-oracular-amd64-dgit-20240625-082704-juju-7f2275-prod-proposed-migration-environment-3-1cb642ce-cf37-4f8e-a233-f561d63f5557 - BUILD - autopkgtest - 2
adt-noble-amd64-systemd-upstream-20240625-101614-juju-7f2275-prod-proposed-migration-environment-3-93c90fad-c9eb-4c82-9f69-b8d4a267b8e2 - BUILD - autopkgtest - 2
count: {'core': 96, 'instance': 45}
quota: {'core': 92, 'instance': 43}

This is a very common example: a few VMs in `BUILD`, and the reported quota is a bit below the manual count.

# bos03-arm64:

quota: {'core': 186, 'instance': 42}
adt-oracular-arm64-r-cran-ps-20240617-073206-juju-7f2275-prod-proposed-migration-environment-3-82b1810f-7af3-447f-b772-c474b3675c87 - BUILD - autopkgtest - 2
count: {'core': 188, 'instance': 43}
quota: {'core': 186, 'instance': 42}

This one is interesting, because the delta between the counted and reported values is exactly the only VM that is displayed. In addition, trying to `openstack server show` this VM reports `No server with a name or ID of '[...]' exists.`, confirming that OpenStack is clearly inconsistent with this one.

# bos02-arm64:

quota: {'core': 145, 'instance': 84}
count: {'core': 55, 'instance': 25}
quota: {'core': 145, 'instance': 84}
This one is really weird: lots of instances counting in the quota, but only a third displayed in `openstack server list`. This is probably a case where we should ask IS to run some magic.

# bos03-s390x and bos03-ppc64el:

quota: {'core': 0, 'instance': 0}
count: {'core': 0, 'instance': 0}
quota: {'core': 0, 'instance': 0}

Not very interesting, but at least it's consistent: if we don't use those OpenStack, the zeros are everywhere.

Changed in auto-package-testing:
status: New → In Progress
Revision history for this message
Skia (hyask) wrote :

About 2., and the `instance.delete(wait=True)` idea, I've dug a bit in the code of the openstack clients, and I've found that:
* The `--wait` option of the `openstack server delete` command is implemented here: https://opendev.org/openstack/python-openstackclient/src/commit/dd6ac285d5b35f87947f9c00f5558b55e9787747/openstackclient/compute/v2/server.py#L2135
* This calls the `wait_for_delete` function defined here: https://opendev.org/openstack/osc-lib/src/commit/f9bcdecf1abd0c38fd60cfd7d99a00b113bd6d57/osc_lib/utils/__init__.py#L661
* This is a simple polling loop, there's nothing more.
* I need to run some experiments, but it looks doable to make that same call to `wait_for_delete` in the worker code.
* In a second time, maybe this is something that could be implemented somehow directly in `novaclient` here: https://opendev.org/openstack/python-novaclient/src/commit/2bd135c13793da48af5fbad5208b00aaba7e39dc/novaclient/v2/servers.py#L66, to add that convenient `wait=True` parameter, but this would first require opening a bug somewhere to discuss this with the maintainers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.