quota_usage data constantly out of sync (needs test)

Bug #1202896 reported by Sam Morrison
72
This bug affects 13 people
Affects Status Importance Assigned to Milestone
Cinder
Incomplete
High
Seif Lotfy
OpenStack Compute (nova)
Confirmed
High
Unassigned

Bug Description

With folsom we constantly had the quota_usage table out of sync, we set the max_age to 10 minutes to help purge this.

Now we've upgraded to Grizzly this seems to have gotten worse, I've had to change this to 30 seconds as we get a lot of users complaining.

I'm not really sure how to replicate it as it seems pretty complicated and there are probably many edge cases but thought I better report it anyway.

Happy to help debug this somehow too.

Revision history for this message
Joe Gordon (jogo) wrote :

Which quotas got out of date? Can you provide any further detail?

Revision history for this message
Sam Morrison (sorrison) wrote :

Cores, instances and ram get out of sync, for cinder it's volumes, gigabytes and snapshots.

It can happen when you delete an instance but it fails to delete, it will go to vm_state error, task_state deleting.
This will decrement the quota_usage values.

This instance still remains in the system for the user and they are able to delete it again. This causes quota_usage values to be decremented again.

This is pretty critical for us as it means projects can use more than their quota.

This is an easy one to replicate but the most common out of sync errors we get are when quota_usage values are higher than they should be.

I'll try and get some more concrete examples.

Revision history for this message
John Griffith (john-griffith) wrote :

I'll work on reproducing this but in the meantime I also noticed we're adjusting qutoas in api and manager which is obviously going to cause some issues.

Changed in cinder:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → John Griffith (john-griffith)
milestone: none → 2013.1.3
Revision history for this message
Joe Gordon (jogo) wrote :

John, this may be similar to a nova issue we had a while back: https://bugs.launchpad.net/nova/+bug/1098380

Changed in cinder:
assignee: John Griffith (john-griffith) → nobody
milestone: 2013.1.3 → none
Revision history for this message
Sam Morrison (sorrison) wrote :
Download full text (4.5 KiB)

Here's an example of a user who is experiencing this issue:

<pre>
mysql> select * from reservations where project_id ='cda9642942d24b7cab4bf1d56f61b5e7';
+---------------------+------------+---------------------+--------+--------------------------------------+----------+----------------------------------+-----------------+-------+---------------------+---------+
| created_at | updated_at | deleted_at | id | uuid | usage_id | project_id | resource | delta | expire | deleted |
+---------------------+------------+---------------------+--------+--------------------------------------+----------+----------------------------------+-----------------+-------+---------------------+---------+
| 2013-07-31 05:36:38 | NULL | 2013-07-31 05:36:39 | 523281 | b1cbdb12-88a6-4601-8b86-5228f31d1ef2 | 4578 | cda9642942d24b7cab4bf1d56f61b5e7 | security_groups | 1 | 2013-08-01 05:36:38 | 523281 |
| 2013-07-31 05:36:41 | NULL | 2013-07-31 05:36:42 | 523284 | 5119268c-1ca4-486b-b173-41f18c166880 | 4578 | cda9642942d24b7cab4bf1d56f61b5e7 | security_groups | 1 | 2013-08-01 05:36:41 | 523284 |
| 2013-07-31 05:36:43 | NULL | 2013-07-31 05:36:44 | 523287 | cfb61f29-f609-4165-b929-773334809869 | 4578 | cda9642942d24b7cab4bf1d56f61b5e7 | security_groups | 1 | 2013-08-01 05:36:43 | 523287 |
| 2013-07-31 05:41:12 | NULL | NULL | 523461 | ae97502d-ec53-46e2-aa9b-3ab17ad5bd23 | 4581 | cda9642942d24b7cab4bf1d56f61b5e7 | instances | 1 | 2013-08-01 05:41:12 | 0 |
| 2013-07-31 05:41:12 | NULL | NULL | 523464 | 34845c5c-2281-4177-bb8b-a78f71499d5d | 4584 | cda9642942d24b7cab4bf1d56f61b5e7 | ram | 8192 | 2013-08-01 05:41:12 | 0 |
| 2013-07-31 05:41:12 | NULL | NULL | 523467 | 86da791d-0adb-4c1f-ae60-5368f813552a | 4587 | cda9642942d24b7cab4bf1d56f61b5e7 | cores | 2 | 2013-08-01 05:41:12 | 0 |
| 2013-07-31 05:48:45 | NULL | 2013-07-31 05:48:45 | 523812 | f21a32c3-dfdc-4d1e-9ca2-0129200975e3 | 4581 | cda9642942d24b7cab4bf1d56f61b5e7 | instances | -1 | 2013-08-01 05:48:44 | 523812 |
| 2013-07-31 05:48:45 | NULL | 2013-07-31 05:48:45 | 523815 | 277d3c09-8b47-4add-87f3-56230c94be3c | 4584 | cda9642942d24b7cab4bf1d56f61b5e7 | ram | -8192 | 2013-08-01 05:48:44 | 523815 |
| 2013-07-31 05:48:45 | NULL | 2013-07-31 05:48:45 | 523818 | efb33b21-de55-4c42-abbb-7ab423a94d45 | 4587 | cda9642942d24b7cab4bf1d56f61b5e7 | cores | -2 | 2013-08-01 05:48:44 | 523818 |
+---------------------+------------+---------------------+--------+--------------------------------------+----------+----------------------------------+-----------------+-------+---------------------+---------+
9 rows in set (0.01 sec)

mysql> select * from quota_usages where project_id = 'cda9642942d24b7cab4bf1d56f61b5e7';
+---------------------+---------------------+------------+------+----------------------------------+-----------------+--------+----------+---------------+---------+
| created_at ...

Read more...

Thierry Carrez (ttx)
Changed in cinder:
milestone: none → havana-3
melanie witt (melwitt)
tags: added: api compute
Changed in cinder:
assignee: nobody → John Griffith (john-griffith)
melanie witt (melwitt)
Changed in nova:
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Joshua Hesketh (joshua.hesketh) wrote :
Changed in nova:
status: Confirmed → In Progress
status: In Progress → Confirmed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-3
Changed in cinder:
milestone: havana-3 → havana-rc1
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-3 → havana-rc1
Changed in nova:
importance: Critical → High
Changed in cinder:
importance: Critical → High
assignee: John Griffith (john-griffith) → nobody
tags: added: havana-rc-proposed
Changed in nova:
milestone: havana-rc1 → none
tags: added: havana-rc-potential
removed: havana-rc-proposed
Changed in cinder:
milestone: havana-rc1 → next
Seif Lotfy (seif)
Changed in cinder:
assignee: nobody → Seif Lotfy (seif)
Revision history for this message
Seif Lotfy (seif) wrote :

my current solution is to get rid of quota_usage table and make a view out of it that reflects the status of the volumes table and snapshots

Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
Revision history for this message
Joe Gordon (jogo) wrote :

Sam are you still seeing this issue, if so how can I reproduce it in nova?

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Joe Gordon (jogo) wrote :

Marking as incomplete because not sure how to reproduce this

Revision history for this message
Sam Morrison (sorrison) wrote :

Yeah it's a tricky one, we have ~3000 users and approx. 30k instance boots per month and we only see this a couple times a month at most.

I'm sure there is a bug in there somewhere but the rate we hit it and my inability to reproduce it make this pretty hard to fix. We're about to upgrade to Havana so we might see this less and less as code matures.

Revision history for this message
gabriel staicu (gabriel-staicu) wrote :

This happens to me always. I have an HA setup for the controller part of Openstack based on mysql galera and every time I terminate more instances (more then 7) in the table quota_usages from nova table there are some resources used.

I have a Havana setup on Ubuntu 12.04.

Revision history for this message
Sam Morrison (sorrison) wrote :

Yes this is still a problem for us, we have written a little script to sync the quota_usages table that we run every 6 hours or so

Revision history for this message
Sam Morrison (sorrison) wrote :

Forgot to mention that upgrading to Havana made it worse

Revision history for this message
Chris Behrens (cbehrens) wrote :

We need to audit the quota code and make sure we're only updating quotas when the DB records for the instance update successfully. Soft delete is a little interesting because it needs to update quotas before 'deleting' the DB record. But we then need to make sure a real delete later doesn't update them again.

Anyway, I filed this bug yesterday which is some of the problem: https://bugs.launchpad.net/bugs/1296414

Revision history for this message
Chris Behrens (cbehrens) wrote :
Revision history for this message
Jacob Cherkas (jcherkas) wrote :

This can be reproduced consistently by launching about 100 instances via dashboard , after they have completed launching, select all the instances and terminate them.

The quota usage in overview will always show some instances, ram and vcpus still in use.

Revision history for this message
Alexei Kornienko (alexei-kornienko) wrote :

I think that all the mess we get with quotas is because cinder quotas implementation is not very consistent.
What i see is that we reserve quotas in 1 method and commit/rollback them in another method. quota reservation can be passed through rpc so it means that it's reserved in 1 processes and commited in another. IMHO such approach is very fragile and error prone.
I propose to implement quota reservation as a context manager:

with quotas.reserve(...) as reservation:
    ...

This will allow us to make sure that quotas always stay consitent and will remove the need to expire quotas. I can prepare a POC patch if you are interested. What do you think?

Revision history for this message
John Griffith (john-griffith) wrote :

@Alexei,
I'd be interested in a POC. I'm curious though, about the Nova side.

Anyway, but I am curious to what we're like in Icehouse and Juno in this respect. Marking Invalid until we get better ways to reproduce.

Changed in cinder:
status: Triaged → Incomplete
Revision history for this message
Sean Dague (sdague) wrote :

I'm marking this as confirmed because I think it's a real issue, there is actually a reproduce in here, and realistically could be addressed with functional testing.

summary: - quota_usage data constantly out of sync
+ quota_usage data constantly out of sync (needs test)
tags: added: needs-functional-test
removed: havana-backport-potential
Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
Duncan Thomas (duncan-thomas) wrote :

> I propose to implement quota reservation as a context manager:
>
> with quotas.reserve(...) as reservation:
> ...
>
> This will allow us to make sure that quotas always stay consitent and
> will remove the need to expire quotas. I can prepare a POC patch if
> you are interested. What do you think?

I don't think we can do this in cinder, the fact that we reserve in the API and commit in the manager appears to be entirely necessary - we reserve in the API so that we can give sensible warnings about being out of quota back to the caller, but we can't commit until the manager since we don't know whether the resource is actually consumed or not until then.

It might be useful to add more information to the reservation record - the request id that caused the reservation for example, then we have some ability to match troublesome reservations to the rest of the logs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.