Revocation event caching is broken across processes

Bug #1607553 reported by Lance Bragstad
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
New
High
Unassigned

Bug Description

It seems the caching of revocation events is broken. I have a devstack stood up with fernet tokens enabled.

If I run tempest.api.identity.admin.v3.test_tokens.TokensV3TestJSON.test_tokens with revocation event caching and fernet enabled, the test fails consistently [0]. This is just one example of a failing test. Several of the tests in the tempest.api.identity suite fail with the same symptoms.

If I disable revocation event caching in the keystone.conf (CONF.revoke.caching=False), the test passes consistently.

[0] http://cdn.pasteraw.com/gx49zwqv5stjmixqbcgoscexcn28w32

Tags: fernet revoke
tags: added: revoke
tags: added: fernet
Revision history for this message
Lance Bragstad (lbragstad) wrote :

I also have a feeling this has something to do with request local caching in keystone, because I get different results when I run the tempest tests serially. In that case I usually get *less* failures than when I run them concurrently.

Revision history for this message
Lance Bragstad (lbragstad) wrote :
Download full text (3.3 KiB)

I attempted to make some progress on debugging this. To do so - I've applied a bunch of logging statements in keystone to give me a little more information about the cache [0]. I have also set the following config values in my keystone.conf.

[DEFAULT]
default_log_levels = oslo_cache=DEBUG
[cache]
debug_cache_backend = true
[token]
provider = fernet

The default_log_levels is what is going to tell us when things hit or miss the cache, as well as when specific keys are removed [1].

When I run the tempest test from the description above, the keystone.log gives me pretty verbose cache logging [2]. The test output also gives the token_id that failed the test [3]. From the test output you can get the token_id and using the output from the logs, you can trace it to the appropriate cache key.

The token is cached twice according to the logs. It's cached once by the _validate_token() [4] method and once by the validate_non_persistent_token() method [5].

The validate_non_persistent_token() key is: 4764d7ee10ba82a2933bb8b55609364fea7b9d1b
The _validate_token() key is: 21f8fb149ffc2c01b1d46d878f1a6117e195e14a

We can see that the cache invalidation is doing it's job because both keys are deleted from the cache when the token is revoked:

CACHE_DELETE: "'4764d7ee10ba82a2933bb8b55609364fea7b9d1b'"
CACHE_DELETE: "'21f8fb149ffc2c01b1d46d878f1a6117e195e14a'"

Everything looks fine up to this point. The keys have been removed from the token cache and a revocation event should have been persisted. The next time the token from the test is validated, the validation will rebuild the entire authorization context from the database and then compare it to the list of revocation events. Revocation events are cached out of the box in keystone, and the revocation cache region should have been invalidated when a new revocation event was added to the list [6].

But this isn't the case. For some reason, the revocation API continues to return the old revocation list to the token provider API when validating the now deleted token. I have a feeling this has something to do with keystone approach to request local caching. If I set [revoke]caching=False, the tests pass consistently every time. Devstack deploys keystone with 5 processes and 1 thread for both the public and admin endpoints. When I switch the process count to 1 and the threads to 1, the test passes consistently with revocation caching enabled.

I have a feeling when multiple processes are running - the revocation cache isn't getting invalidated across all processes, which would make sense with what Morgan stated [7]

[0] http://cdn.pasteraw.com/o35mn2ipac41f1hx68zz37alj3bwob
[1] https://github.com/openstack/oslo.cache/blob/d2a5e1d041c504ade2352e6449397645fb2449d3/oslo_cache/core.py#L65-L100
[2] http://cdn.pasteraw.com/ms750ntgl1nv00z6a16jfew5jft23ml
[3] http://cdn.pasteraw.com/67vnkdoo4n07f2xsqyruv6qlj9fi0xc
[4] https://github.com/openstack/keystone/blob/627eeeaef421867bf501f450cd207c3fb94d56ee/keystone/token/provider.py#L299-L300
[5] https://github.com/openstack/keystone/blob/627eeeaef421867bf501f450cd207c3fb94d56ee/keystone/token/provider.py#L296-L297
[6] https://github.com/openstack/keystone/blob/627eeeaef4...

Read more...

Changed in keystone:
importance: Undecided → High
milestone: none → newton-3
Revision history for this message
Lance Bragstad (lbragstad) wrote :

From the debugging that I have done and the comments Morgan has made in the code, it looks like we will have to be able to invalidate cache regions across process before we can make fernet the default. Or disable revocation event caching until that happens.

summary: - Revocation event caching is broken
+ Revocation event caching is broken across processes
Revision history for this message
David Stanek (dstanek) wrote :

I think this is a duplicate of 1590779

Revision history for this message
Lance Bragstad (lbragstad) wrote :

Looks like Boris has a patch up for fixing the region invalidation across processes [0]. For some comparison - if I run tempest.api.identity against a devstack'd keystone (at commit 8a669faba), I get an average of 24 failures out of 187 total tests run.

With Boris' patch (as of patch set 13), I get an average of 5 failures per test run. With the patch, it does seem that more tests pass as the revocation table fills up (this slows down the revocation API).

The failures are still race condition failures, Boris' patch just reduces the frequency of the failures.

[0] https://review.openstack.org/#/c/327885/13

Changed in keystone:
milestone: newton-3 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.