OpenStack Identity (keystone)

token revocations not always respected when using fernet tokens

Bug #1484237 reported by Matt Fischer on 2015-08-12

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
OpenStack Identity (keystone)	Fix Released	Medium	werner mendizabal	OpenStack Identity (keystone) 8.0.0 "liberty"
Kilo	Won't Fix	Medium	Dolph Mathews
OpenStack Security Advisory	Won't Fix	Undecided	Unassigned

Bug Description

A simple test that shows that fernet tokens are not always being invalidated.

Simple test steps:

1) gets a token
2) deletes a token
3) tries to validate the deleted token

When I run this in production on 10 tokens, I get about a 20% success rate on the token being detected as invalid, 80% of the time, keystone tells me the token is valid.

I have validated that the token is showing in the revocation event table.

I've tried a 5 second delay between the calls which did not change the behavior.

My current script (below) will look for 204 and 404 to show failure and will wait forever. I've let it wait over 5 minutes, it seems to me that either keystone knows immediately that the token is invalid or not at all.

I do not have memcache enabled on these nodes.

The same test has a 100% pass rate with UUID tokens.

Tags:

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-12:

revoke test code Edit (615 bytes, text/x-sh)

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-12:

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

Changed in ossa:
status:	New → Incomplete

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-12:

Lance Bragstad from the keystone team has been working on this with me, adding him to the bug.

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2015-08-12:

Here is the script (similar to Matt's) that I am using in order to test/recreate this, I changed it to work with v3 and openstackclient:

http://cdn.pasteraw.com/2yw4almzgsblfzaku85ia1nnzrhp3jw

Here is the output I get when I run the script:

http://cdn.pasteraw.com/nl296l0pyrz0la73fp4v0njgtmjrpjb

While the script is trying to validate a "deleted" token, I can see the validate calls in the Keystone logs:

http://cdn.pasteraw.com/fwvzodl38r6kkzc0o8ih554hg5j7rx6

when it should be a 404.

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2015-08-12:

Doing a little more investigation, and I seem to be hitting something with revocation_events.

When the script executes as expected, I get a token from Keystone, delete the token, and the GET on that token returns a 404 Not Found, a new revocation event is added to the revocation_event table.

When the script doesn't execute as expected, and the deleted token is still validated by Keystone, no revocation events are added to the Keystone table. I think this probably has more to do with revocation events not working properly than Fernet, but Fernet tokens have to rely on the revocation API, it would directly effect it.

This will require a deeper investigation of revocation events.

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2015-08-12:

FYI, these are the steps that I've done to recreate on the latest master:

1.) start keystone locally
2.) curl http://cdn.pasteraw.com/2yw4almzgsblfzaku85ia1nnzrhp3jw > mfisch.sh
3.) source openrc # vars for openstack client
4.) bash mfisch.sh 10

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-08-12:

Are you able to (re)produce this behavior in stable/kilo?

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2015-08-12:

Yes, I am able to recreate this on Kilo: (commit-id: 220eed87b9e28294f4c1f2b482db420fbe20dc7e) which is the latest commit on the stable/kilo branch I believe.

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-14:

Using Lance's script, I'm not able to reproduce this against either the current stable/kilo or master, when backed by sqlite and eventlet. The only change I have to keystone's config is to switch the token provider to Fernet. The output spins like this, successfully every time:

  getting & revoking 1 tokens
  admin token: ADMIN
  token to revoke: gAAAAABVzj9EpQ6I5U8ZDnixVz4ao8_FPGMDRlO5zRiaNLv2OYjXDbvQVuu2b7OFaqSVS1KFOmJBv6W25DRB3vYHZVeyaRh1A7t1vVtHhvp2M1DS9OQjyiYrkBYxXgwMX2IBhCpvRLtUxFF-E5G7ThkmkoPpjI-8RsB0yOuNPcMZVXrsFwQKce8%3D
  trying for 0 seconds...
  admin token: ADMIN
  token to revoke: gAAAAABVzj9EpQ6I5U8ZDnixVz4ao8_FPGMDRlO5zRiaNLv2OYjXDbvQVuu2b7OFaqSVS1KFOmJBv6W25DRB3vYHZVeyaRh1A7t1vVtHhvp2M1DS9OQjyiYrkBYxXgwMX2IBhCpvRLtUxFF-E5G7ThkmkoPpjI-8RsB0yOuNPcMZVXrsFwQKce8%3D
  finally invalid after 0 seconds

I'm going to work with Lance to see if there's something different about my setup.

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2015-08-14:

#10

For reference, I made some more revision with the script that I was using (based on Matt's) to recreate this;

http://cdn.pasteraw.com/3cepd7tpr87rvlf4i38sgh840o875zx

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-14:

#11

I'm not able to reproduce against stable/kilo with the script in comment #10 either (unmodified). Also, for reference:

$ env | grep OS_
OS_PROJECT_NAME=admin
OS_IDENTITY_API_VERSION=3
OS_PASSWORD=admin
OS_AUTH_URL=http://localhost:35357/v3/
OS_USERNAME=admin

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-14:

#12

My setup is more complex:

* mutli-node cross-region galera cluster with 6 nodes and an arbitrator (although I can repro in a 3 node one region cluster too)
* 3 node keystone cluster
* wsgi with 32 procs and 128 threads

I'm also not using stable/kilo or a recent master so I will go with what you and Lance can repro.

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-14:

#13

I should have added that I'm using a Round-robin balanced keystone cluster fronted by a hardware LB for calls on port 5000 and haproxy for port 35357.

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-14:

#14

I was able to successfully reproduce on both stable/kilo and master by switching from sqlite to mysql-server + mysql-python (which means this is going to be challenging to write tests against inside keystone).

Dolph Mathews (dolph) on 2015-08-14

Changed in keystone:
status:	New → Confirmed

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-08-17:

#15

It would help the VMT to have some sort of exploit scenario as a basis for the decision of whether we'll need a security advisory and whether this needs to be solved under embargo. Can someone explain at least one way this issue would be leveraged by a would-be attacker to gain some advantage?

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-18:

#16

I've been thinking about the implications here. I've been unable to make it fail after removing the project or user, if that is confirmed than the risk here is that an operator revokes a single token but doesn't remove the user, who could then generate another token. This operation doesn't make much sense to me as an operator, I'd probably disable or delete a rogue user rather than just revoking their tokens.

One other thing to consider is that Horizon revokes a token on logout. If there is a way to get the token out of the browser cache then that would be an avenue of attack. Think of someone signing in to openstack on a public computer and logging out only to leave a usable token behind. I don't know enough about Horizon to tell if that's possible.

I also did not test what happens if the token is revoked and the user is simply disabled not deleted. That should also be tested. I think most operators would disable a user first until the resources could be cleaned up.

Someone from the Keystone team should probably weigh in here too.

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-18:

#17

Jeremy: As Matt suggested, this doesn't seem to present an attack vector itself, but certainly might keep an attacker's window open once they can get ahold of a valid token.

Once we identify the source of the problem, it'll be easier to analyze the scope of impact. I'd prefer to keep this private until then.

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-19:

#18

Although this bug is specific to PKI, I'm wondering if these two issues might end up being related?

https://bugs.launchpad.net/keystone/+bug/1464377

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-21:

#19

Fix for master (v1) Edit (1.4 KiB, text/plain)

This was a real heisenbug! Jumping into a debugger made the issue unreproducible, so that was fun.

Fortunately, this only appears to be reproducible under a fairly synthetic workload. That is, if you introduce even a half second delay (either client or server side) between the time the token is created and the time it's deleted, token revocations work as expected.

On SQLite, timestamps are stored in the revocation event table using microsecond-level precision. Whereas on MySQL, timestamps are stored in the revocation event table using second-level precision.

In Fernet, the token's creation date is persisted using second-level precision. Whereas all other token formats (UUID, PKI, and PKIz) persist the creation date using microsecond-level precision.

Because both Fernet and MySQL persist the creation date using second-level precision, it's *much* more likely that you'll get an exact match between the token's creation date and the revocation event's "issued_before" attribute. By introducing a short delay between the two being created, you guarantee that the clock will tick over to the next whole second so that they won't match.

This is entirely possible with UUID, PKI and PKIz and SQLite, but I don't think anyone has a system fast enough to do all the necessary work within a single microsecond (... right?).

Further, keystone doesn't use MySQL in it's unit test suite, so I haven't included a test in the proposed patch because it would happily pass against SQLite without the fix anyway.

Changed in keystone:
importance:	Undecided → Medium

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-21:

#20

Fix verified in my dev cluster.

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-21:

#21

I've also shown verified that even a delay of 1 second in my test shell script will cause this not to occur. That means that that the attack vector here is literally less than a second. And is probably 100% artificial. An attacker would have to get a token and an admin would have to notice and revoke it in less than a second for the token revoke to fail in this manner.

Revision history for this message

Matt Fischer (mfisch) wrote on 2015-08-21:

#22

Given the above, I think that this is artificial enough to be opened up and not private anymore.

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-08-21:

#23

+1 for comment #22. The possibility that this would be used as a useful attack vector seems incredibly small, although it is certainly a security issue in nature.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-08-21:

#24

I agree this seems like a very impractical/unlikely vulnerability in any real-world deployment, so class C1 in our report taxonomy: https://security.openstack.org/vmt-process.html#incident-report-taxonomy

information type:	Private Security → Public
tags:	added: security
Changed in ossa:
status:	Incomplete → Won't Fix

Dolph Mathews (dolph) on 2015-08-21

Changed in keystone:
assignee:	nobody → Dolph Mathews (dolph)

Dolph Mathews (dolph) on 2015-08-24

tags:

added: kilo-backport-potential

OpenStack Infra (hudson-openstack) on 2015-08-24

Changed in keystone:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-03: Fix proposed to keystone (master)

#25

Fix proposed to branch: master
Review: https://review.openstack.org/220259

Changed in keystone:
assignee:	Dolph Mathews (dolph) → werner mendizabal (nonameentername)

Revision history for this message

Dolph Mathews (dolph) wrote on 2015-09-10:

#26

[Thu 12:37] <morgan> lbragstad: ah. then can we just move towards everything being not subsecond?
[Thu 12:37] <morgan> audit_ids mean subsecond isn't needed
[Thu 12:37] <morgan> the whole reason for subsecond was to make PKI tokens "unique"
[Thu 12:37] <morgan> so they didn't hash to the same value

So, as an alternative solution to this bug, I think we can opt to ensure that no one expects subsecond precision from revocation events.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-11: Fix merged to keystone (master)

#27

Reviewed: https://review.openstack.org/216236
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=9450cd9699c002adcdb8f64c95ffa2c002717568
Submitter: Jenkins
Branch: master

commit 9450cd9699c002adcdb8f64c95ffa2c002717568
Author: Dolph Mathews <email address hidden>
Date: Fri Aug 21 18:38:26 2015 +0000

Handle tokens created and quickly revoked with insufficient timestamp precision

    In the event that the revocation event is created at the exact same
    timestamp as the token's creation timestamp, the event's issued_before
    will equal the token's issued_at and will thus not be revoked (according
    to the current code).

    This is much more likely to occur when a token's issue_at timestamp is
    rounded to whole seconds (rather than carrying microsecond level
    precision), as they are with Fernet and MySQL.

    Change-Id: If1f5e546463f189a0b487140a620def545006c25
    Closes-Bug: 1484237
    Related-Bug: 1488208

Changed in keystone:
status:	In Progress → Fix Committed

Dolph Mathews (dolph) on 2015-09-11

tags:

removed: kilo-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-11: Fix proposed to keystone (stable/kilo)

#28

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/222728

Thierry Carrez (ttx) on 2015-09-22

Changed in keystone:
milestone:	none → liberty-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in keystone:
milestone:	liberty-rc1 → 8.0.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Change abandoned on keystone (stable/kilo)

#29

Change abandoned by Steve Martinelli (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/222728
Reason: abandoning since it hasn't moved since september

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-09: Change abandoned on keystone (master)

#30

Change abandoned by Steve Martinelli (<email address hidden>) on branch: master
Review: https://review.openstack.org/220259
Reason: no change in >3 months, and the original bug is now fixed in liberty, closing this change up