Quota exceeded for instances: Requested 1, but already used 10 of 10 instances

Bug #1257641 reported by Joe Gordon on 2013-12-04
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
tempest
Medium
Unassigned

Bug Description

Tempest (or nova) appears to be leaking servers. Since we run gate tests with concurrency=4 I don't think we should normally hit a case where we are consuming 10 instances.

http://logs.openstack.org/84/59684/4/check/check-tempest-dsvm-postgres-full/9e38ebb/testr_results.html.gz

http://logstash.openstack.org/#eyJzZWFyY2giOiJcIlF1b3RhIGV4Y2VlZGVkIGZvciBpbnN0YW5jZXM6IFJlcXVlc3RlZCAxLCBidXQgYWxyZWFkeSB1c2VkIDEwIG9mIDEwIGluc3RhbmNlc1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg2MTQ1OTIzMDMwfQ==

query: "Quota exceeded for instances: Requested 1, but already used 10 of 10 instances" AND filename:"console.html"

First seen on 12/3

Joe Gordon (jogo) wrote :

Possible nova patches that may have triggered this:

I7b116d8154036c43b0feb23f5669f52358408d2b
Ibffe35da221730283b8ad3016a25dd42b5526f37
Id4bff71b2c46ec9dccea6d8ac36eba273e48ea5d

Matt Riedemann (mriedem) on 2013-12-05
Changed in nova:
status: New → Confirmed
tags: added: testing
Attila Fazekas (afazekas) wrote :

The root cause is the instance remains in the 'image_uploading' state, for more then 196 sec, it is not expected.
grep -B 1 8bf31076-4c33-4b41.*image_uploading tempest.txt.gz # http://logs.openstack.org/34/59034/5/check/check-tempest-dsvm-full/548ebd0/logs/tempest.txt.gz

It is not a quota issue.

The quota issue a secondary failure.

NOTE: This tempest change would help in making the real issue more visible https://review.openstack.org/#/c/54318/,
I will rebase it when the 'image_uploading' state issue addressed.

Changed in tempest:
assignee: nobody → Ken'ichi Ohmichi (oomichi)

Fix proposed to branch: master
Review: https://review.openstack.org/61313

Changed in tempest:
status: New → In Progress
Ken'ichi Ohmichi (oomichi) wrote :

According to http://logs.openstack.org/84/59684/4/check/check-tempest-dsvm-postgres-full/9e38ebb/logs/screen-n-api.txt.gz ,
many server deletions happens against the same server(id: f026ea0a-2416-47fa-86e4-2bc90e01276d) in ServerActionsTestJSON tests:

$ grep ServerActionsTestJSON-tempest-1102878952-tenant screen-n-api.txt.gz | grep "DELETE /v2/1b34950e30e6452bbae32b9f06d7bf53/servers/" | awk '{print $1 " " $2 " " $10 " " $13}' | sed s@"/v2/1b34950e30e6452bbae32b9f06d7bf53/servers/"@""@g
2013-12-04 06:34:49.639 f026ea0a-2416-47fa-86e4-2bc90e01276d 204
2013-12-04 06:35:20.347 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:35:32.904 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:35:54.841 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:36:04.800 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:36:39.970 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:36:52.429 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:37:09.871 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:37:23.965 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:37:36.607 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:38:17.267 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:38:17.641 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:38:18.022 f026ea0a-2416-47fa-86e4-2bc90e01276d 404
2013-12-04 06:38:18.231 ad4beaca-836a-44f7-822e-e23217906f11 204
2013-12-04 06:38:18.530 288c5e83-f843-4666-a7c1-7f609c0fee79 204
2013-12-04 06:38:18.990 3fb965f6-53e5-4126-91b2-0fac6c35fdb0 204

The server is created when "2013-12-04 06:30:26.613".
The would be created at setUpClass() as the first server.

 51 def setUpClass(cls):
 52 super(ServerActionsTestJSON, cls).setUpClass()
 53 cls.client = cls.servers_client
 54 cls.server_id = cls.rebuild_server(None)

That seems wrong scope of "server_id" variable.

Ken'ichi Ohmichi (oomichi) wrote :

We can reproduce this problem in local environment, by applying the attached patch to Tempest.
This patch kicks a server re-creation one time, and "Quota exceeded" would happen.

Reviewed: https://review.openstack.org/61313
Committed: http://github.com/openstack/tempest/commit/122cdf5cea8e479949f993513ec556e9b12adb59
Submitter: Jenkins
Branch: master

commit 122cdf5cea8e479949f993513ec556e9b12adb59
Author: Ken'ichi Ohmichi <email address hidden>
Date: Wed Dec 11 21:32:25 2013 +0900

    Fix the scope to share a server between tests

    In test_server_actions.py, one server is created in setUpClass()
    and the server would be shared between all tests.
    If the server status is not 'ACTIVE' before each test, rebuild_
    server() deletes it and rebuilds a server again.

    The server_id scope which is passed to rebuild_server() is class
    variable not instance variable. In addition, the server_id scope
    which is stored with the return value of rebuild_server() is
    instance variable like the following:

      [[instance]] [[class]]
      self.server_id = self.rebuild_server(self.server_id)

    If we need to rebuild a server, the server_id, which is passed to
    rebuild_server(), cannot be changed due to its scope. As the result,
    many server creations happens in each test.
    Then it causes "Quota exceeded" exception.

    This patch fixes the scope of server_id for sharing the same server
    between each tests.

    Change-Id: I72b4aa73a6c77cce3a8e777d4f78913fb0986f66
    Closes-Bug: #1257641

Changed in tempest:
status: In Progress → Fix Released
Ken'ichi Ohmichi (oomichi) wrote :

This problem is due to Tempest bug, not Nova.

Changed in nova:
status: Confirmed → Invalid
Joe Gordon (jogo) wrote :

http://status.openstack.org/elastic-recheck/ shows this bug hasn't been seen in 12 hours, thanks!

Joe Gordon (jogo) wrote :

This has returned

Changed in tempest:
status: Fix Released → New
Changed in tempest:
status: New → Fix Released
status: Fix Released → New
Attila Fazekas (afazekas) wrote :

The quata issue is always a secondary failure.

One change made wrong the server recreation logic in tempest which bug is triggered by a REAL failure, which leaded to the quota issue.
That bug is fixed in tempest.

If check for example this log: in this not isolated case almost all server creation failed.
http://logs.openstack.org/79/47079/15/check/check-tempest-dsvm-neutron/0d75ad0/console.html
38 failures, mostly cased by sever remained in ERROR state issues in neutron TEMPEST_ALLOW_TENANT_ISOLATION=False job.

But tempest should delete the ERROR-ed servers anyway.

Attila Fazekas (afazekas) wrote :

If server creation failed in a setUpClass the tearDownClass will not be called.

The setUpClasses should have decorator which
 - re raise the original exception
 - calls the tearDownclass (log the exception from the tear down)
 - the decorator should be replacement of the classmethod decorator, and used only with setUpClasses.

The tearDowns needs to be prepared to a partially executed setUpClass.

Sean Dague (sdague) on 2013-12-19
Changed in tempest:
status: New → Confirmed
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Could bug 1186354 and it's patch https://review.openstack.org/#/c/43822/ be related to this at all?

Ken'ichi Ohmichi (oomichi) wrote :

This problem seems due to two causes.

1. In check-tempest-dsvm-neutron, "server creation" fails and BuildError servers remain.
2. If BuildError servers happen in setUpClass of Tempest, Tempest cannot delete the servers.
When BuildError servers happen many times in setUpClass, it causes this "Quota exceeded".

The first cause seems due to Nova or Neutron.
The second one is due to Tempest.

https://review.openstack.org/#/c/63365/ tries to fix the second cause.

Reviewed: https://review.openstack.org/63365
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=51c8c2637e7751c67b020bf006611bb63e5b8884
Submitter: Jenkins
Branch: master

commit 51c8c2637e7751c67b020bf006611bb63e5b8884
Author: Ken'ichi Ohmichi <email address hidden>
Date: Sat Dec 21 03:30:37 2013 +0900

    Delete a BuildError server

    If a server creation fails with create_test_server() in setUpClass,
    the server would not be removed because tearDownclass is not called.
    As the result, many BuildError servers are created and "Quota exceeded"
    happens if server creations continue to fail.

    This patch makes create_test_server() removes a server when BuildError
    happens for avoiding "Quota exceeded" exception and investigating the
    original problem which is due to the BuildError.

    Change-Id: I55bb386bbea550de11cc7a81c5d056680057523b
    Related-Bug: #1257641

Joe Gordon (jogo) on 2014-01-11
no longer affects: nova
Jay Pipes (jaypipes) wrote :

I don't see any occurrences of this in the last two weeks. Can this be closed out?

Ken'ichi Ohmichi (oomichi) wrote :

I confirmed Tempest sends a "delete the server" request if BuildError server is made as https://review.openstack.org/63365 patch.

I pick http://logs.openstack.org/76/67976/2/check/check-tempest-dsvm-neutron/3f207e7/ as a sample.
The first BuildError server(b3dea397-0281-45a1-ae17-b339f6b13134) was detected 2014-01-21 04:32:12.857 and Tempest send a "delete the server" request at the same time. but this problem still happens.

I guess it is because server creations would be faster than server deletes.
So the way, to add wait_for_resource_deletion() after sending a "delete the server" request, would be a workaround.

Ken'ichi Ohmichi (oomichi) wrote :

My comment #22 is invalid.

http://logs.openstack.org/18/62118/7/check/check-tempest-dsvm-full/3bfd6ec/logs/

All requests to Nova failed.
both "create server" and "delete server" API would fail due to "No space left on device".

Related fix proposed to branch: master
Review: https://review.openstack.org/75830

Jaroslav Henner (jhenner) wrote :

Above is a draft of what afazekas is proposing.

Changed in tempest:
assignee: Ken'ichi Ohmichi (oomichi) → nobody
Attila Fazekas (afazekas) wrote :

http://logs.openstack.org/45/94945/3/check/check-grenade-dsvm-neutron/77fc511/logs/old/screen-n-sch.txt.gz#_2014-05-30_01_43_54_403

Now $log message seen only in neutron (not isolated) havana->icehouse grenade jobs.

For instance in this job the n-cpu not started at the old side of the grenade.
http://logs.openstack.org/45/94945/3/check/check-grenade-dsvm-neutron/77fc511/logs/old/screen-n-sch.txt.gz#_2014-05-30_01_43_54_403

In another case the n-sch log is not found, probably not started:
http://logs.openstack.org/35/94435/4/check/check-grenade-dsvm-neutron-icehouse/b4a9319/logs/old/

The root cause is still not a tempest issue and it is different to the reported issue.

Changed in tempest:
importance: High → Medium
Matt Riedemann (mriedem) wrote :

We aren't seeing this in the gate so marking as incomplete since it seems investigation has stalled.

Changed in tempest:
status: Confirmed → Incomplete
Matt Riedemann (mriedem) on 2014-10-16
Changed in tempest:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers