Token List in Memcache can consume an entire memcache page

Bug #1171985 reported by Morgan Fainberg on 2013-04-23
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
High
Morgan Fainberg

Bug Description

The token list and revocation lists in the memcache token driver can hit the limit of a given memcache page. This can occur if a user continually issues tokens and does not let the list page(s) expire out of the memcache server. The revocation page, in theory, would never expire out of the memcache server if there was any consistent amount of revocations occurring.

Typically this will occur if the token count reaches somewhere in the 31,700 range using the 1MB default page size. While this is a fairly sizable number, the issue lies in that the memcache token driver does not have any logic to expire out tokens from either the revocation list or the active token list (with the exception of a delete, which moves the token from "active" to "revocation"). This means that there is the potential that any given user could exceed the size of these two lists over an extended period of time (or even a short period of time if an account issues/revokes a large quantity of tokens consistently).

The revocation list appears to be more sensitive to this effect since it looks to hold the entire token contents instead of just the ID.

When this event occurs, the result will be that no further tokens can be issued (for a given user/tenant combination) or no further tokens can be added to the revocation list.

Morgan Fainberg (mdrnstm) wrote :

This can be duplicated in grizzly, essex, and master by issuing a very large number of tokens for a given user/tenant combination without letting the the lists expire from memcache.

description: updated
Changed in keystone:
assignee: nobody → Morgan Fainberg (mdrnstm)
Adam Young (ayoung) wrote :

We had a discussion of this at the Summit. It sounds like the fix it to have a threshold check when assigning a token, and to clean up when we exceed the threshold. The threshold is probably percent of the page filled, but might also be time sensitive: for example we might decide to clean up all tokens older than 24 hours.

Morgan Fainberg (mdrnstm) wrote :

That was the approach I was going to aim for, however, since the current mechanism for assigning a token is to simply issue an append() instead of loading the entire list up and verifying the list, I have shifted over to handling the flushing of expired / deleted tokens from the list when/if the maximum page-size event happens. This means that only in the exceptional state of hitting an error do we take on the overhead of flushing out the now-dead tokens.

I am using the .cas() (compare and update) functionality in the memcache driver to ensure that if multiple keystones are being run (via wsgi or otherwise) race conditions are handled (with a configurable maximum number of retries).

The revocation list is being handled in much the same way, but also has a maximum-list-size configuration option that allows the revocation list to be flushed of defunct tokens on any given call to list_revoked_tokens(). The only sticking point I am running into is what to do if there are no tokens to clean up from the revocation list; do we remove the oldest tokens? closest to expiring? an arbitrary number? I am trying to avoid removal of tokens from revocation list prematurely for security reasons (however, I am sure this will be the minority of cases we have to handle this situation).

I am open to moving back to the original concept that would require a get/deserialize of the entire list (either user_record or revocation) each time to validate the length, but I think that generally speaking the overhead of issuing the already atomic .append() directly is a superior method simply to keep the overhead as low as possible unless there is a case to take action.

I am also looking to add a "created_at" field to handle the second case outlined that will allow for the flushing function to remove long-lived tokens (that have yet to expire) from the user_record if a flush is required. Based upon the fact that the revocation-list is a bit more sensitive to this type of problem, tokens flushed in this manner probably will not be moved to the revocation-list since it would be possible to then fill the revocation list.

Any comments/insight would be welcome. I should have the code (and additional tests) ready to be looked over in the next couple days (proposed against master; I'll handle the grizzly/stable version of this after the rest of the work is proven out).

Morgan Fainberg (mdrnstm) wrote :

The bug filing and approach is directly reflecting the conversation we had about this at the summit.

Fix proposed to branch: master
Review: https://review.openstack.org/27597

Changed in keystone:
status: New → In Progress
Dolph Mathews (dolph) on 2013-06-10
Changed in keystone:
importance: Undecided → High

Reviewed: https://review.openstack.org/31585
Committed: http://github.com/openstack/keystone/commit/c639c53176473aa9196b1e5a2b65991477b02cb3
Submitter: Jenkins
Branch: master

commit c639c53176473aa9196b1e5a2b65991477b02cb3
Author: Morgan Fainberg <email address hidden>
Date: Mon Jun 3 16:50:23 2013 -0700

    Fix token purging for memcache for user token index.

    When issuing a new token, purge all expired tokens from the user's
    token index list.

    New Options:
    * max_compare_and_set_retry:
            The number of retries that will be attempted when performing
            an update of the user_record or the revocation-list record.
            This is relevant due to the use of CAS (compare and set)
            function of the memcache client. This allows for multiple
            keystone processes/wsgi/etc to run without worry of race
            conditions clobbering the lists.

    DocImpact - New Options.

    Change-Id: I9441105b1e46982b0354bccbf8297daaaa1904b2
    Fixes: bug #1171985

Changed in keystone:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-07-17
Changed in keystone:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-10-17
Changed in keystone:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers