race between quota setting and quota being enforced

Bug #1310597 reported by Sean Dague
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Invalid
Undecided
Unassigned
tempest
Fix Released
High
Castulo J. Martinez

Bug Description

There is a race between the time that a quota is set and when it's enforced in swift. This seems to be exposed in the test_account_quotas_negative.py in tempest. The relevant code is:

    def setUp(self):
        super(AccountQuotasNegativeTest, self).setUp()
        # Set the reselleradmin auth in headers for next custom_account_client
        # request
        self.custom_account_client.auth_provider.set_alt_auth_data(
            request_part='headers',
            auth_data=self.reselleradmin_auth_data
        )
        # Set a quota of 20 bytes on the user's account before each test
        headers = {"X-Account-Meta-Quota-Bytes": "20"}

        self.os.custom_account_client.request("POST", url="", headers=headers,
                                              body="")

    def test_upload_large_object(self):
        object_name = data_utils.rand_name(name="TestObject")
        data = data_utils.arbitrary_string(30)
        self.assertRaises(exceptions.OverLimit,
                          self.object_client.create_object,
                          self.container_name, object_name, data)

Some amount of time this instead passes, as seen in this set of request logs:

2014-04-21 10:41:53,223 Request (AccountQuotasNegativeTest:setUp): 204 POST http://127.0.0.1:8080/v1/AUTH_6890d28fd6e543d1a5b3ba0932bff7d3
2014-04-21 10:41:53,248 Request (AccountQuotasNegativeTest:test_upload_large_object): 201 PUT http://127.0.0.1:8080/v1/AUTH_6890d28fd6e543d1a5b3ba0932bff7d3/TestContainer-1647149708/TestObject-1284036749 0.023s
2014-04-21 10:41:53,263 Request (AccountQuotasNegativeTest:tearDown): 204 POST http://127.0.0.1:8080/v1/AUTH_6890d28fd6e543d1a5b3ba0932bff7d3

The behavior of account quota updates being synchronous or async isn't specified here - http://docs.openstack.org/developer/swift/middleware.html#module-swift.common.middleware.account_quotas so this might be working as designed (and the test case needs to be adjusted to detect quota changes) or it might be a bug. Registering it on both Tempest and Swift for this reason.

Link to failed run demonstrating this behavior: http://logs.openstack.org/81/88481/1/check/check-tempest-dsvm-neutron/fac198e/console.html

Link to logstash showing these hits: http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkZBSUw6IHRlbXBlc3QuYXBpLm9iamVjdF9zdG9yYWdlLnRlc3RfYWNjb3VudF9xdW90YXNfbmVnYXRpdmUuQWNjb3VudFF1b3Rhc05lZ2F0aXZlVGVzdC50ZXN0X3VwbG9hZF9sYXJnZV9vYmplY3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5ODA4Mjk3MjAxNH0=

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/89340

Sean Dague (sdague)
Changed in tempest:
importance: Undecided → High
Revision history for this message
Samuel Merritt (torgomatic) wrote :

This is working as designed in Swift. Uploading an object does not synchronously update the account's X-Account-Bytes-Used value; rather, it is asynchronously updated by swift-container-updater at some future time. Until that happens, the account quota middleware will be using the old X-Account-Bytes-Used value, so some object PUTs that should fail will succeed instead.

There's not really a good upper bound on how long that delay is, either. The container updater walks the filesystem to find container DBs and pushes each one's stats to the relevant account (HTTP PUT request), so how long it takes is a function of the number of containers on the system and the average time per update and the failure rate of updates.

Revision history for this message
Daisuke Morita (morita-daisuke) wrote :

I guess that this failure is occurred by cache feature in Swift.

In this testcase, 30-byte object will be uploaded after quota is set as 20 bytes. Therefore, if quota is set correctly, the request should fail even though x-account-bytes-used is not updated according to L138-140 in account_quotas.py [1]. However, account information is sometimes picked up from memcache (L120 in account_quotas.py), therefore I guess the account info does not contain quota information.

Does above supposition seem right? I would love to hear from Samuel's comment.

This cache is refreshed in 60 seconds. So, sending HEAD request periodically to confirm that quota is set correctly is better solution than just skipping this testcase.

[1] https://github.com/openstack/swift/blob/master/swift/common/middleware/account_quotas.py

Revision history for this message
Samuel Merritt (torgomatic) wrote :

The main problem is the asynchronous update, not memcache. Even if you set your cache validity period to 0 (no caching at all), you'd still see the same behavior.

Revision history for this message
clayg (clay-gerrard) wrote :

The account_info in memcache doesn't really play into things like Sam is talking about. The account db's themselves are subject to the cycle time of the container-updater. In a test system you could ensure those daemons are running and test that at some timeout beyond their frequency account quotas will kick in.

But it doesn't sound like the async update is the problem exactly. They're making a 30 byte request with an account quota limit of 20 bytes. The POST to update the account metadata should clear the bits in memcache ahead of the object upload:

https://github.com/openstack/swift/blob/master/swift/proxy/controllers/account.py#L119

... but I can't find the subsequent GET_INFO call on the account logs ...

http://logs.openstack.org/81/88481/1/check/check-tempest-dsvm-neutron/fac198e/logs/

I tracked down the reseller admin POST that's probably setting the quota (tx8bc6e0eb7be040e580a3f), the container quota info check (tx0edf54df71a0433dad1ea), then the PUT for the object (tx0edf54df71a0433dad1ea) - which logs the 30 bytes. The account server sees the POST, but the next thing that shows up in the account log for that account is another POST (tx5564bed946244147b125f) - presumably some sort of clean up?

Couple of things I thought of but couldn't quite line up with the logs...

 * transfer encoding chunked bypassing the account quota (still should have called get_info)
 * reseller request for object upload (the POST is, but the PUT doesn't seem to be)
 * container quotas over-riding the check or get_info data (it's so hard to trace get_info, but every time I dig down through GETorHEAD_base I convince myself it's correct - CQ doesn't seem to be doing anything that would bypass the account verification in get_container_info)

... so I'm not sure what to make of it?

Maybe this code is wrong for over-limit requests on empty accounts?

https://github.com/openstack/swift/blob/master/swift/common/middleware/account_quotas.py#L121

... still don't see how that would explain account info apparently cached across a POST?

FWIW, I noticed the container-updater sweep come though about a minute after the test finished:

    account-server: 127.0.0.1 - - [21/Apr/2014:10:42:16 +0000] "PUT /sdb1/10/AUTH_6890d28fd6e543d1a5b3ba0932bff7d3/TestContainer-1647149708" 204 - "-" "-" "container-updater 12633"

Revision history for this message
Daisuke Morita (morita-daisuke) wrote :

Thanks, Samuel and Clay.

As Clay mentioned, this problem seems unrelevant to async update. I noticed that the interval between the finish time of POST (tx8bc6e0eb7be040e580a3f) and the start time of object PUT (tx0edf54df71a0433dad1ea) is just 0.006 sec. Therefore, asynchronous procedure can end up in different result.

As Clay also pointed, the POST request should clear the cache. This function call gets to 'memcache.delete(cache_key)' in base.py L391. This delete function sends delete request to memcached server. Honestly, I am unfamiliar with memcache, so I do not know the request is processed synchronously or asynchronously in memcached server. But, if it is processed asynchronously, this problem is a delicate timing problem.

https://github.com/openstack/swift/blob/master/swift/proxy/controllers/base.py#L391
https://github.com/openstack/swift/blob/master/swift/common/memcached.py#L423

Any comments?

If that helps, POST request (tx5564bed946244147b125f) is probably a removing quota request with X-Remove-Account-Meta-Quota-Bytes header in tearDown method. And, it may be OK for over-limit requests on empty accounts. If X-Account-Bytes-Used is 0, account_info['bytes'] has 'string' value of '0', therefore pass if-statement in L121.

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/89340
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=90952bc6432f2bf81e5a3418ee77ced6eb3320ea
Submitter: Jenkins
Branch: master

commit 90952bc6432f2bf81e5a3418ee77ced6eb3320ea
Author: Sean Dague <email address hidden>
Date: Mon Apr 21 08:38:17 2014 -0400

    skip the quota enforcement test for swift

    this test is racing because by the time the large object upload
    happens the quota update might not yet have propogated in swift.
    As it's unclear if this is a swift bug or tempest assuming too
    much, skipping is best for now.

    Change-Id: I0c0a5fc5a23125a9c51e220ae5a1218c6e9ca075
    Related-Bug: #1310597

Changed in swift:
status: New → Invalid
Changed in tempest:
status: New → Confirmed
Changed in tempest:
assignee: nobody → Castulo J. Martinez (castulo-martinez)
Revision history for this message
Castulo J. Martinez (castulo-martinez) wrote :

All the container and account updates are done by background daemons running in the cluster somewhere, not by anything in the HTTP-serving path, as a consumer of the HTTP API you have no idea when your object PUT will be reflected in account stats and hence quotas. It doesn't seem to be a way to run this test in a reliable manner, without depending on timing, at least not from Tempest, the test code would need to be able to actually control the running of the various daemons, so it could force an update when it wanted. I think the only thing to do is to remove this test completely from Tempest, besides it has been skipped now for more than 2 years.

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

What about polling the quotas and wait for the quotas to be updated ?

Revision history for this message
Castulo J. Martinez (castulo-martinez) wrote :

That could work Jordan, but when I tried that in my environment it would take some times more than 10 minutes to be enforced, that would be way too long for a test. I guess we could try and see how it behaves at the gate, but I'd need to add a timeout for it, what do you think that should be?

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

The skipped test is not active in 2 years, and this swift behavior is designed as Swift developers said.
In addition, I don't think we need to spend much time for this negative case on the gate.
So it is nice to remove the test itself now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/354873

Changed in tempest:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/354873
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=09a16c44caaf4b201d299130d7700c76da2e7b12
Submitter: Jenkins
Branch: master

commit 09a16c44caaf4b201d299130d7700c76da2e7b12
Author: Castulo J. Martinez <email address hidden>
Date: Fri Aug 12 10:53:09 2016 -0700

    Remove the test_upload_large_objects negative test

    The test_upload_large_objects test that is part of the
    test_account_quotas_negative.py has been skipped for 2 years now,
    and there is no way to run this test in a reliable way since all the
    container and account updates are done by background daemons running
    in the cluster somewhere.
    This commit removes the test from Tempest since it is a test that
    won't be run anymore.

    Change-Id: Id3917003f0cf3c0ab652c9ea16d22c696f52933f
    Closes-Bug: 1310597

Changed in tempest:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.