Zero-downtime upgrades lead to spurious token validation failures when caching is enabled

Bug #1833085 reported by Sebastian Riese
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Triaged
High
Unassigned

Bug Description

When doing the zero-downtime upgrade routine of keystone and having caching enabled we observe validation failures of valid tokens.

The problem is, that if both running versions share a cache, both may cache validated tokens, but the type of the tokens changed between the two versions (and the cache pickles the objects-to-cache directly). In queens it is a dict, in rocky it is a dedicated type `TokenModel`. This causes exceptions when the tokens are loaded from the cache and don't have the expected attributes.

The offending code is
<https://github.com/openstack/keystone/blob/stable/queens/keystone/token/provider.py#L165>
vs.
<https://github.com/openstack/keystone/blob/stable/rocky/keystone/token/provider.py#L150>
the `@MEMOIZE_TOKEN` decorator serializes the tokens into the cache, both versions use the same keyspace, but the type of the objects has changed.

Disabling the caching (by setting `[caching] enabled = false` in the config) or disabling all but one keystone instances fixes the problem (of course disabling all but one keystone instance defeats the whole purpose of a zero-downtime upgrade – this was just done to validate the cause of the issue).

This issue and the possible workaround (disabling the cache) should at least be documented. If it is safe to run the instances with separate caches (per instance or per version) this may be workaround with less of a performance impact, but I am not sure, whether this would be safe with respect to token invalidation. My understanding is, that on token revocation the keystone instance handling the API request invalidates the cache entry and adds the revocation event to the database. So if the token was already stored as validated in the other cache, this would cause the token to be accepted as valid by some of the keystone services (which use the other cache which says it is valid). So with a load balancer in front of the keystones the revoked token would sometimes validate.

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

The easiest/best solution (after talking through IRC) is to document using a new cache cluster for the upgraded side. This has the downside of cache-mismatch (everything should be pointed at the new / upgraded side).

I think this is best handled with a documentation change covering the dual cache setup.

Changed in keystone:
status: New → Confirmed
status: Confirmed → Triaged
importance: Undecided → High
tags: added: caching documentation
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.