tokens in memcache have no/improper expiration

Bug #1578401 reported by Matt Fischer
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Opinion
Low
Unassigned
oslo.cache
Confirmed
Wishlist
Unassigned

Bug Description

tokens stored in memcache have no (improper) expiration data when set.

Found on stable/mikata and stable/liberty using cachepool backend and the non-pooled backend.

When you store in memcache you can optionally pass in a time at which the value is no good, a ttl. Keystone should be doing this for it's local token cache but it doesn't look like it is.

Here is a dump of some slabs that have tokens in them, last field is expiration time:

stats cachedump 11 10
ITEM 8302fc81f2ffb552d5ba8d3e5f0e182ee285786a [724 b; 1460583765 s]
ITEM 2ffe5d0821302a8501068a8411ce1749cea0645b [776 b; 1460583765 s]
ITEM eb6e6f7e9118133a9a98944da874ac1b59c5675b [724 b; 1460583765 s]
ITEM ee076b853dd5e5956366854abf6c49dbdf5ee4c2 [723 b; 1460583765 s]

Lets see if these are really tokens:

get 8302fc81f2ffb552d5ba8d3e5f0e182ee285786a
VALUE 8302fc81f2ffb552d5ba8d3e5f0e182ee285786a 1 724
cdogpile.cache.api
CachedValue
p0
((dp1
S'access'
p2
(dp3
S'token'
p4
(dp5
S'issued_at'
p6
S'2016-05-04T21:20:27.000000Z'
p7
sS'expires'
p8
S'2016-05-04T23:20:27.146911Z'
p9
sS'id'
p10
V<fernet token value here>

Yep thats a Fernet token.

Dumping older and older stuff, I can find cached tokens that are 24 hours old in here, 22 hours past our valid token deadline.

So lets compare that to some tokens that keystone_authtoken middleware is caching for control services:

stats cachedump 21 100
ITEM tokens/418b2c5a0e67d022b0578fbc4c96abf4a4509e94aca4ca1595167f8f01448957 [8463 b; 1462397763 s]
ITEM tokens/2b5a26e3bdf4dec0caae141353297f0316b55daf683b4bc0fcd1ab7bf4ba8f9b [8312 b; 1462397539 s]
ITEM tokens/778329eb53545cbd664fa67e6429f48692679f428077b48baa4991f13cc1817c [8312 b; 1462397538 s]
ITEM tokens/b80b06cf688c37f8688c368a983c2fd533c662b7b3063c6a2665c59def958cdd [8312 b; 1462397537 s]
ITEM tokens/61cd52b0654641a21c62831f6e5be9f0328898d05026d6bb91c787d79cb8b460 [8463 b; 1462397536 s]

All have valid and different expiration times so it's respecting my settings.

So what's that timestamp in the earlier list? Well it's 4/13/2016, 3:42:45 PM GMT-6:00 DST. That happens to be the last time memcache restarted and so I assume it's just filler.

What's the impact?

I've not figured out if there is one yet for sure. I have a token valid time of 2 hours and I had set cache time to the same. I did try to dig out an old token but it would not validate so I don't think there's a security issue. I suspect the main issue is that my keystone memcache always runs completely 100% full. We max memcache at 20% of RAM on a box, and that's a lot (20% of 256G). I suspect with no valid expiration time memcache is lazily evicting old tokens when it runs out of ram rather than replacing expired ones and not allocating more RAM.

[PROD] mfisch@east-keystone-001:~$ cat /proc/3937/status
Name: memcached
State: S (sleeping)
Tgid: 3937
Ngid: 3937
Pid: 3937
PPid: 1
TracerPid: 0
Uid: 65534 65534 65534 65534
Gid: 65534 65534 65534 65534
FDSize: 1024
Groups: 0
VmPeak: 54500552 kB
VmSize: 54500552 kB <-- that's a lot of twinkies

I feel this merits deeper investigation.

Tags: caching
Dolph Mathews (dolph)
tags: added: fernet
Revision history for this message
David Stanek (dstanek) wrote :

I'll look into this a little deeper, but from the information provided it doesn't sound like there is an issue. The 'stats cachedump' command will often, and maybe even likely, show things that are expired. memcache won't actually delete the item from memory until you access it again and it decides that it's expired or if it's trying to make room for new entries. In the new entries case I'm not sure if you'd see if as an eviction because the data wasn't forced out.

You can tell if an item has no expiration pretty easily if it's expiration is equal to 'time - uptime' where time and uptime can be found in the stats output.

Revision history for this message
David Stanek (dstanek) wrote :

Also note that memcache is greedy and will use up (or at least appear that way to unix tools) all of the memory you give it.

Revision history for this message
David Stanek (dstanek) wrote :

Matt and I talked offline about this and we'll work together next week on it.

Changed in keystone:
status: New → Incomplete
David Stanek (dstanek)
Changed in keystone:
assignee: nobody → David Stanek (dstanek)
Changed in keystone:
assignee: David Stanek (dstanek) → nobody
David Stanek (dstanek)
Changed in keystone:
assignee: nobody → David Stanek (dstanek)
Revision history for this message
David Stanek (dstanek) wrote :

This behavior is actually expected based on our current implementation. keystonemiddleware is actually using the memcached library directly and setting timeouts. keystone uses dogpile and sets a dogpile expiration and not a memcached expiration. This means that the data will expire like expected, but it won't be expired as far as memcached is concerned.

Other than appearing to be fuller than it actually is, I don't think there is too much impact. Even when a memcached timeout exists on an entry is will not actually be deleted from memory by memcached until it is accessed again or memcached needs the memory.

I've created a POC change in olso.cache that "fixes" this problem.

Changed in keystone:
status: Incomplete → Confirmed
importance: Undecided → Low
tags: added: caching
removed: fernet
Revision history for this message
David Stanek (dstanek) wrote :

If this is worth having we could bring back this review: https://review.openstack.org/#/c/350942

I don't really think this is a bug, but rather a misunderstanding since we do caching differently between components.

Revision history for this message
Lance Bragstad (lbragstad) wrote :

Automatically unassigning due to inactivity.

Changed in keystone:
assignee: David Stanek (dstanek) → nobody
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

This is not a bug. Dogpile.cache can set an actual memcache expiration, however, we have simply been leaning on the LRU capabilities on memcache. For keystone this would need to be changed in oslo.cache not in Keystone.

I am marking this as opinion, it's not really a bug, it is how we have implemented the cache. We can revisit this if there is a desire as an RFE for oslo.cache.

Changed in keystone:
status: Confirmed → Opinion
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I'd like to re-start the discussion about this bug.

What this combination currently leads to is effectively false positive alerts on memcached monitoring.

Monitoring memcached for forced evictions is a well established practice to get alerted when you need to increase the cache size or redistribute it between slabs or check in any other way for possible abnormalities of cache usage.
But this of course relies on cache users to not dump stuff into the cache to keep it "forever". Memcached server AFAIU has no periodic garbage collection, so even with TTL set, memcached will only delete the data if either client tries to access that expired key, or to make room for new data - in which case it won't consider it a forced eviction.

Here however keystone keeps storing token creation/validation data, for whatever reason it is never deleted from memcached by dogpile itself (because at some point too nothing asks for that potentially expired info, and it also does not have any periodic GC? anyway basically impossible in a wsgi app) and it has no TTL in memcached as well, so when memcached tries to make room for new data, it deletes them and fires forced eviction alerts.

I'd like oslo.cache maintainers to please restore the patch https://review.opendev.org/c/openstack/oslo.cache/+/350942 and I will take it over.

Revision history for this message
Daniel Bengtsson (damani42) wrote :

Hi,

Do you have any update it's still relevant?

Revision history for this message
Takashi Kajinami (kajinamit) wrote :

Re-reading the existing implementation I think this is a valid improvement for memcache driver and also redis driver, which do not currently use the expiration mechanism within backend service.

I'll look into this during this cycle, but one thing I noticed here is the fact that the solution earlier proposed breaks the case where a specific cache_time option is configured by a very large value. Probably we have to add a new option to control expiration in backends.

Changed in oslo.cache:
status: New → Confirmed
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.