nova quota statistics can be incorrect

Bug #1284424 reported by Robert Collins
100
This bug affects 24 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dmitry Stepanenko
tripleo
Fix Released
High
Unassigned

Bug Description

on the ci-overcloud we had a couple of network interruptions to the control plane. Subsequent to this, nova is reporting:
 nova boot --image user --flavor m1.small --key-name default live-migration-test2 --nic net-id=f69ac547-db64-4e69-ae70-e5233634aff0
ERROR: Quota exceeded for instances: Requested 1, but already used 100 of 100 instances (HTTP 413) (Request-ID: req-0c96b3bc-dc37-4685-8227-02398b3bea6b)
(ci-overcloud-nodepool)robertc@lifelesshp:~/work$ nova list | wc -l
42

That is - nova thinks there are a 100 instances, but there are only 42. We haven't done any DB surgery or anything that could cause this, so methinks we've uncovered a bug.

Tags: compute quotas
Tracy Jones (tjones-i)
tags: added: compute network
tags: removed: network
Changed in nova:
importance: Undecided → High
status: New → Triaged
Changed in nova:
assignee: nobody → Shawn Hartsock (hartsock)
Changed in nova:
status: Triaged → In Progress
Revision history for this message
Michael Still (mikal) wrote :

I think we need more detail in this bug report in order to determine what went wrong. It would be interesting to see a dump of the quota tables when this happens.

Changed in nova:
status: In Progress → Incomplete
Revision history for this message
Michael Still (mikal) wrote :

(Also, Shawn, you should let gerrit set this to in progress when you upload a review to address the problem).

Revision history for this message
Robert Collins (lifeless) wrote :

So this has happened again, just in regular use as a CI cloud. Mikal - what information would you like - just a select * from quota_* ?

Changed in nova:
status: Incomplete → New
Revision history for this message
Robert Collins (lifeless) wrote :

Once we gather whatever data is needed:

11:14 < lifeless> SpamapS: hey, you had a query to fix quotas in the ci-overcloud right ?
11:14 < SpamapS> lifeless: yes... it was...
11:14 < lifeless> SpamapS: perhaps put a copy of it in https://bugs.launchpad.net/tripleo/+bug/1284424
11:14 < uvirtbot> Launchpad bug 1284424 in tripleo "nova quota statistics can be incorrect" [High,Triaged]
11:15 < SpamapS> update quota_usages set in_use=-1 where project_id='64d2d3bc07084ef1accd4e3502909c77';
11:15 < SpamapS> lifeless: that id == nodepool
11:15 < lifeless> that forces a native recalculate?
11:16 < SpamapS> lifeless: it did for me
11:16 < SpamapS> which begs the question..

Revision history for this message
melanie witt (melwitt) wrote :

Setting to confirmed since the reporter saw the issue again.

Changed in nova:
status: New → Confirmed
Matt Riedemann (mriedem)
tags: added: quotas
Jay Pipes (jaypipes)
Changed in nova:
assignee: Shawn Hartsock (hartsock) → nobody
Revision history for this message
danieru (samuraidanieru) wrote :

Can add that I'm seeing this problem with a vanilla install, per the Icehouse install instructions for Ubuntu Trusty 14.04. It seems like the delete process isn't cleaning everything up properly. I tried deleting several instances at once from the dashboard. They got stuck on the error delete process and I had to manually use 'nova reset-state' and then 'nova delete' to make them disappear. Now the quota is still reporting them.

Revision history for this message
Rajani Srivastava (srivastava-rajani) wrote :

Hi Robert,

Are you referring to "nova usage-list" command ?

When we delete any instance, the above command still shows the usage for that deleted instance in cli. While it is not displayed in the dashboard.

Revision history for this message
haruka tanizawa (h-tanizawa) wrote :

Hi Robert
I'm afraid ... but I want to know this bug in detail.
This is related to cleaning quota when deleting instance or
nova usage-list ?
My concern is that 'network interruptions' in your report.
The number of 42 is too far from 100. Is it still happen ?
If you have specific situation, writing more deeply is very thankful.

Revision history for this message
haruka tanizawa (h-tanizawa) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1353962 is possibly related, it was seen in the gate.

Revision history for this message
Kevin Fox (kevpn) wrote :

I've just ran into this too with a fresh rdo juno on sl7. I had some config options wrong on a compute node and launched some test instances to it. They launched with error, deleted fine but quota's wrong.

I can't recall exactly, but the reproducer in this thread:
http://www.gossamer-threads.com/lists/openstack/dev/38332

Sounds very similar to what happened. I think I may have had a bad neutron_admin_password in /etc/nova/nova.conf but I may be misremembering exactly which password option/config file that was involved.

Revision history for this message
Kevin Fox (kevpn) wrote :

I also tried SpamapS's recommendation for fixing the quota's by setting the tenants quota values to -1. They have not yet updated the data. Still all -1's. I also tried restarting nova-compute on all the hosts to see if that would update things. It didn't. Is there some amount of time it takes to sync, or something else that needs to happen before things get updated?

Revision history for this message
Joe Gordon (jogo) wrote :

A few possible ways to reproduce this:

https://github.com/pcrews/rannsaka

add a mock on top of RPC calls/casts to force them to fail some percentage of the time.

Revision history for this message
Patrick Crews (patrick-crews) wrote :

This command line should be a good starting point against a devstack install:
rannsaka.py --host=http://$KEYSTONE_IP --requests 20000 -w 15 --test-file=locust_files/server_quota.py

Also, I seem to recall quota statistics going awry after a libvirt restart, but will need further testing to verify / will update tomorrow after more tests.

Revision history for this message
Joe Gordon (jogo) wrote :

I was able to reproduce the failure with:

http://paste.ubuntu.com/10587937/ + restarting nova api services while the test ran

Results:

 http://paste.openstack.org/show/191961/

Revision history for this message
Joe Gordon (jogo) wrote :

Using "nova quota-delete --tenant" to correct the quotas

Revision history for this message
Joe Gordon (jogo) wrote :

Looks like restarting nova-compute during nova boots and nova deletes causes the quota sync issue.

Revision history for this message
Joe Gordon (jogo) wrote :

Note: setting max_age or until_refresh in nova.conf should help minimize the quota out of sync issues.

Revision history for this message
Joe Gordon (jogo) wrote :

After further testing, the quotas don't get out of sync from restarting n-cpu during instance boots. This happens when restarting n-cpu during instance deletes

Revision history for this message
Joe Gordon (jogo) wrote :
Changed in nova:
assignee: nobody → BalaGopalaKrishna (bala-9)
Changed in nova:
assignee: BalaGopalaKrishna (bala-9) → nobody
Revision history for this message
Chris Friesen (cbf123) wrote :

Should this be marked as a dupe of https://bugs.launchpad.net/nova/+bug/1296414 ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/293800

Changed in nova:
assignee: nobody → Dmitry Stepanenko (dstepanenko)
Revision history for this message
Dmitry Stepanenko (dstepanenko) wrote :

Reproduced issue on devstack with reclaim_instance_interval set in nova.conf. There were several changes which affected this issue and the results now looks a bit different, but quota statistics is still can be incorrect.

http://paste.openstack.org/show/491422/

The easiest way to reproduce the issue - run Joe's script and terminate nova-compute service right after message 'Request to delete server ... has been accepted' appears. This will lead to a state when quota usage is more than amount of instances.

Revision history for this message
Dmitry Stepanenko (dstepanenko) wrote :

I found that my repro hit scenario when the instance was already deleted, but quotas commit didn't happened yet (nova-* process died right after removing instance). This situation occured in _delete_instance method of nova/compute/manager.py before _complete_deletion happened - https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2465

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Steven Hardy (shardy) wrote : potentially eol bug

This bug was reported against an old version of TripleO, and may no longer be valid.

Since it was reported before the start of the liberty cycle (and our oldest stable
branch is stable/liberty), I'm marking this incomplete.

Please reopen this (change the status from incomplete) if the bug is still valid
on a current supported (stable/liberty, stable/mitaka or trunk) version of TripleO,
thanks!

Changed in tripleo:
status: Triaged → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/293800

doscho (doscho)
Changed in nova:
status: In Progress → New
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/293800
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Ben Nemec (bnemec)
Changed in tripleo:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.