Race condition between token validation and revocation API causes intermittent gate failures.

Bug #1578866 reported by Steve Martinelli on 2016-05-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
High
Lance Bragstad

Bug Description

test_user_update_own_password is failing intermittently on a variety of jobs

stack trace:
Traceback (most recent call last):
  File "tempest/api/identity/v2/test_users.py", line 71, in test_user_update_own_password
    self.non_admin_users_client.token)
  File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 480, in assertRaises
    self.assertThat(our_callable, matcher)
  File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 493, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: <bound method TokenClient.auth_token of <tempest.lib.services.identity.v2.token_client.TokenClient object at 0x7fa26c484610>> returned {u'token': {u'expires': u'2016-05-06T00:13:53Z', u'issued_at': u'2016-05-05T23:13:54.000000Z', u'audit_ids': [u'mbdiQZcNT5GxEUebXZqKOA', u'BAlcCwKLS9Co8C3jg2vfAw'], u'id': u'gAAAAABXK9Oyhuw7yBJJehrIIGlzIB8VTbgnM_M5Cve9q0BEHeZ2xNohJ_lkVqp7kicVbNgZ93p2dcLHfUfXWCcPvO4BWkTIry1mAGSvhzeLI7SYxSS6CBpeGK0FH3Uf_5vhHTCWFvcDvKOSzajGImeN7GaYts91H1zsXV7B1HRs0xN-4LADokI'}, u'metadata': {u'roles': [], u'is_admin': 0}, u'serviceCatalog': [], u'user': {u'roles_links': [], u'username': u'tempest-IdentityUsersTest-972219078', u'name': u'tempest-IdentityUsersTest-972219078', u'roles': [], u'id': u'97a1836c5a2c40c99575e46aa37b8b50'}}

example failures: http://logs.openstack.org/17/311617/1/gate/gate-tempest-dsvm-neutron-linuxbridge/084f25d/logs/testr_results.html.gz

and http://logs.openstack.org/91/312791/2/check/gate-tempest-dsvm-full/88d9fff/logs/testr_results.html.gz

Steve Martinelli (stevemar) wrote :

this is likely due to a race condition between when the old tokens are revoked and the user authenticating again.
definitely related to fernet tokens, since that change landed and these failures started to appear.
this resulting in about 20 hits in the last 12 hours.

Brant Knudson (blk-u) wrote :

that test even has a sleep in it!

Lance Bragstad (lbragstad) wrote :

I have a tempest/devstack environment stood up with the following SHAs:

devstack 8d27280f3e845841e78acf659a7e8b605122517e
tempest 48241069be894c28bc4e679b23b776cd82cb9ff6
keystone 7a18200ff6fb80a2408dbbe3172fe73dfb13366c

I have run tempest.api.identity.v2.test_users.IdentityUsersTest.test_user_update_own_password in a loop 50+ consecutively without a failure. Is there something else about the setup that I'm missing in order to recreate this?

Lance Bragstad (lbragstad) wrote :

I don't think this is specific to v2. I've run tempest.api.identity.v2.test_users.IdentityUsersTest.test_user_update_own_password 200+ times consecutively without failure, so I think v2 it in good shape. There are v3 failures in the logs Matt linked above though. I'm going to do the same procedure but running tempest.api.identity.v3.test_users.IdentityV3UsersTest.test_user_update_own_password consecutively.

Lance Bragstad (lbragstad) wrote :

Brant and I were both able to run the tempest.api.identity.v3.test_users.IdentityV3UsersTest.test_user_update_own_password test consecutively without getting a failure. In isolation these tests seem work fine. It could be due to something happening concurrently?

Jordan Pittier (jordan-pittier) wrote :

I use the following logstash query:

message:"self.assertRaises(lib_exc.Unauthorized," AND tags:console

It's broader, still 100% failure. It only affects job that run "devstack master" (so Keystone master) and not stable branches.

Brian Haley (brian-haley) wrote :

I've been seeing this in the neutron functional check jobs, with both keystone v3:

http://logs.openstack.org/84/314384/2/check/gate-tempest-dsvm-neutron-dvr/cc154a3/console.html

and v2:

http://logs.openstack.org/84/306984/15/check/gate-tempest-dsvm-neutron-linuxbridge/1572ec2/console.html

I don't think either one of those changes should be tweaking anything in keystone.

Fix proposed to branch: master
Review: https://review.openstack.org/315735

Changed in keystone:
assignee: nobody → Lance Bragstad (lbragstad)
status: Confirmed → In Progress

Here comes a wild theory.

The revocation api in keystone is what's used in the change password workflow. When a user changes their password a revocation event is stored in keystone that details some information about the user and the time that the user made the password change (which is stored as revoked_at and issued_before attributes of the revocation event). During token validation, if a token's issued_at time is less than or equal to a revocation event's issued_before time, we consider the token invalid, erring on the side of security. This is where the one-second waits come into play. This works if sql *truncates* microsecond precision from the timestamp of the revocation event. For example, if a user gets a token at 10:52:25.000002Z and changes their password at 10:52:25.000004Z, both the token's issued_at time and the revocation event's issued_before times will be stored as 10:52:25.000000Z. The token's issued_at time is stored in the Fernet token and the revocation event's issued_before time is stored in mysql. When the uses goes to validate the token they got at 10:52:25.000002Z, it will be revoked because keystone is comparing 10:52:25.000000Z to 10:52:25.000000Z and notices they are equal [0]. This is all assuming mysql is truncating extra precision from the token.

That might not be the case in the event mysql attempts to round a datetime value based on the precision. Which is apparently possible with some version of mysql [1].

Is it possible to figure out what version of mysql was running in the failure [2]? If so - I'd be happy to try and recreate this with that exact version of mysql.

[0] https://github.com/openstack/keystone/blob/0f579aa362f021b17f7c2931ffca309e395bd51f/keystone/models/revoke_model.py#L223
[1] http://stackoverflow.com/questions/29711102/sqlalchemy-mysql-millisecond-or-microsecond-precision
[2] http://logs.openstack.org/30/314330/3/check/gate-tempest-dsvm-neutron-dvr/3d9272f/

Dr. Jens Harbott (j-harbott) wrote :

Can this be related to keystone only storing the token revocation in in-memory kvs? And keystone process running under Apache being respawned occasionally?

Lance Bragstad (lbragstad) wrote :

Jens,

Is that something that is done in certain devstack jobs? I've been trying to recreate this with tempest + keystone setup by devstack locally, where the revocation events are stored in sql [0].

[0] https://github.com/openstack/keystone/blob/64925bf698cb45fde3c0941cfcbec5b45d31a0d7/keystone/common/config.py#L348-L353

Lance Bragstad (lbragstad) wrote :

Samuel has proposed some patches to keystone that add logging to the revocation API [0]. We expected these patches to fail only to get more information. In one of the failures [1] you can see that the revocation event that should match the token isn't even stored [2]. It seems this may actually be a race condition between storing the revocation event and validating the token. I'm not entirely sure how this makes sense because the revocation event should be stored before keystone returns a 204 to the user. It could be something to do with using notifications to send revocation events? Still, we only use the notification framework to emit certain revocation events. There are places in keystone where revocation events are stored without using the notification API.

[0] https://review.openstack.org/#/q/topic:token-revocation-issue
[1] http://logs.openstack.org/56/316256/4/check/gate-tempest-dsvm-postgres-full/6956621/console.html#_2016-05-13_23_39_15_618
[2] http://pastebin.com/tRdTUA6n

Lance Bragstad (lbragstad) wrote :

This behavior is happening against more than just test_user_update_own_password. I think we can trace this behavior to any tests that are receiving a valid authentication response when they expect an invalid one.

Another common test that is failing with these conditions are the negative role tests [0].

[0] https://github.com/openstack/tempest/blob/c69d06dbb4c14eec305ed6f0db947af77427c04c/tempest/api/identity/admin/v2/test_roles_negative.py#L66-L73

summary: - test_user_update_own_password failing intermittently
+ Race condition between token validation and revocation API causes
+ intermittent gate failures.

From what I can tell Neutron is not the only project affected.

no longer affects: neutron
Adam Young (ayoung) wrote :

I think it is fixed by this change, which, among other things, removes caching.
https://review.openstack.org/#/c/311652/

Morgan Fainberg (mdrnstm) wrote :

I think has been solved with the linear search for rev. events.

Changed in keystone:
milestone: none → newton-3
Changed in keystone:
status: In Progress → Fix Released
milestone: newton-3 → newton-1

Change abandoned by Lance Bragstad (<email address hidden>) on branch: master
Review: https://review.openstack.org/315735
Reason: This is no longer needed since we landed https://review.openstack.org/#/c/368244/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers