Keystone token can be missed due keystone+memcached design

Bug #1490927 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
Medium
MOS Keystone

Bug Description

During boot_and_list_server rally scenario 618th iteration has been failed
from rally.log: http://paste.openstack.org/show/437145/
500 code from haproxy at the time: http://paste.openstack.org/show/437210/
from nova-all.log on node-227: http://paste.openstack.org/show/437213/
from glance-all.log on node-224: http://paste.openstack.org/show/437303/
401 code from haproxy at the time: http://paste.openstack.org/show/437305/
from keystone_wsgi_admin_error.log on node-224: http://paste.openstack.org/show/437306/

test was started at 29 07:46:07.175. Rally created tokens around 29 07:50:54. It seems keystone token was expired therefore nova/nova_client couldn't be authorized to get image from glance.

Cluster configuration:
Ubuntu,IBP,HA,Neutron-vlan,DVR,Ceph-all,Nova-debug,Nova-quotas,7.0-182
Controllers:3 Computes:180 Copmutes+Ceph: 20

api: '1.0'
astute_sha: e24ca066bf6160bc1e419aaa5d486cad1aaa937d
auth_required: true
build_id: 2015-08-17_03-04-59
build_number: '182'
feature_groups:
- mirantis
fuel-agent_sha: 57145b1d8804389304cd04322ba0fb3dc9d30327
fuel-library_sha: 9de2625d26c3b88d22082baecd789b6bd5ddf3fa
fuel-nailgun-agent_sha: e01693992d7a0304d926b922b43f3b747c35964c
fuel-ostf_sha: 17786b86b78e5b66d2b1c15500186648df10c63d
fuelmain_sha: d8c726645be087bc67e2eeca134f0f9747cfeacd
nailgun_sha: 4710801a2f4a6d61d652f8f1e64215d9dde37d2e
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 4c74a60aa60c06c136d9197c7d09fa4f8c8e2863
release: '7.0'

Diagnostic snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-02_05-59-17.tar.xz

Tags: scale
Revision history for this message
Leontii Istomin (listomin) wrote :

If token is expired we can see the line like this in keystone log:
2015-09-01 12:13:55.175 33001 DEBUG keystone.token.persistence.backends.kvs [-] Token `7a06d0ebeb9142c6883d148e92d6393a` is expired, removing from `usertokens-4bfc4dec0774451c8eeba4761eb51f3e`. _update_user_token_list /usr/lib/python2.7/dist-packages/keystone/token/persistence/backends/kvs.py:183

We can't see the similar string for 300c8d6c6dce45c4a6a377eb070fd1af token

Changed in mos:
assignee: MOS Nova (mos-nova) → MOS Keystone (mos-keystone)
Boris Bobrov (bbobrov)
Changed in mos:
assignee: MOS Keystone (mos-keystone) → Boris Bobrov (bbobrov)
Revision history for this message
Boris Bobrov (bbobrov) wrote :

> If token is expired we can see the line like this in keystone log:

Not necessary. With our changes it can happen after some time after it's expired. These log entries are about some cleaning up and it doesn't mean that the token is not expired if you see don't see this message.

In rally log I see this message: http://paste.openstack.org/show/437684/ . I think it's rally who sent the expired token to OpenStack.

Keystone works as expected here.

Changed in mos:
assignee: Boris Bobrov (bbobrov) → MOS Nova (mos-nova)
affects: mos → rally
Changed in rally:
milestone: 7.0 → none
assignee: MOS Nova (mos-nova) → nobody
Revision history for this message
Leontii Istomin (listomin) wrote :

Rally team could you please implement requesting keystone tokens before they become expired state. Please review the following logic:
1. In beginning of scenario we request token
2. look at expiration time of the token and calculate difference between current time and time when the token should be expired. It will be expiration time which configured in keystone.conf as "expiration" parameter.
3. start benchmark
4. When test time=expiration_time/2 we recreate token with force-new-token option.

Other words we request new token before it expires.

Revision history for this message
Boris Pavlovic (boris-42) wrote :

Leontiy,

I think you mislead how Rally works.

We don't get tokens at the begging of test execution.
Rally is creating token for each iteration not only for first.
It means that even after N hours of execution each iteration will get new fresh own tokens.

According to the reality proposed algorithm won't work because of many reasons:
1) First of all we don't request token
2) The second monkey patching scenarios to do this in the middle seems like a bad idea
3) Rally doesn't have access to keystone.conf

By the way requesting fresh token (on expiration) is actually part of functionality of python clients that should work out of box

Revision history for this message
Leontii Istomin (listomin) wrote :

from nova-all.log on node-227: http://paste.openstack.org/show/437213/ we can see that user with id=ddd5f4ffa6f246eaab14415e45de7685 was used, but
from glance-all.log on node-224: http://paste.openstack.org/show/437303/ we can see that user with other id=4b95f7b246224d509cf38462818385c8.
paste from glance logs doesn't related to our issue (ImageNotAuthorized: Not authorized for image 96d56d18-7faa-4196-893a-320a8cf056f8) We started to look at this just because time the same as from nova.log

So, we need to investigate again starting from nova trace

Revision history for this message
Leontii Istomin (listomin) wrote :

The full chain of actions:
1. Rally made request to Nova: http://paste.openstack.org/show/440800/
SHA1 of the token is {SHA1}c352daec08d32dd10b2604e4c5129a522f93c759
2. Nova checked token: http://paste.openstack.org/show/440801/
expiration time for the token 2015-08-29T09:50:51.000000Z
3. Nova started creating an instance http://paste.openstack.org/show/440802/
4. Nova doesn't write messages from Glance-client. But SHA1 (c352daec08d32dd10b2604e4c5129a522f93c759) of token (300c8d6c6dce45c4a6a377eb070fd1af) from error in glance log http://paste.openstack.org/show/437303/ the same as in rally request from step 1.
From keystone we can see that keystone couldn't find the token http://paste.openstack.org/show/437306/

So, we can faced with behavior when token can be missed. It known issue for keystone+memcached and should be fixed in 8.0 due "Fernet token" feature implementation.

affects: rally → mos
Changed in mos:
assignee: nobody → MOS Keystone (mos-keystone)
milestone: none → 8.0
summary: - Nova doesn't request new keystone token if old token expires.
+ Keystone token can be missed due keystone+memcached design
Changed in mos:
importance: Undecided → Medium
Changed in mos:
status: New → Confirmed
Revision history for this message
Leontii Istomin (listomin) wrote :

from memcached.conf we can see that memory parameter set to 128931MB. At the moment isn't clear why we have missed tockens. Therefore we need to investigate it for 7.0

Changed in mos:
milestone: 8.0 → 7.0
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

My current understanding is that we are missing the patch:

https://review.fuel-infra.org/#/q/I59c572b1c572ef5b81670837ec11daa3d1137e14,n,z

(due to fact we were going to have Fernet tokens in 7.0 originally)

Without that patch it's possible that one or more memcache servers temporarily go down, which changes the logic of keys sharding:

https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L399-L417

and when all memcache servers are online again, keystone will look for the token key in the wrong memcache instance and mistakenly treat a valid token as a non-existing one.

Also, it seems that keystonemiddleware or Nova itself caches the results of token validation and in a few seconds you can see a token successfully validated in Nova and then 401 in Glance for the very same token.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

I don't think so. I think the issue is that one of components/clients does not update token when it's invalid. The request in comment #2 is made with already expired token and keystone, as it should, returns 404 (401). Memcache did not "blink" during that test.

Although the patch mentioned above is indeed missing, it's not the cause.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Boris, indeed, memcache didn't seem to fail (at least, it didn't restart).

Still, for some reason Keystone can't find a valid token:

1) rally authenticates and calls novaclient: http://paste.openstack.org/show/444060/

2) nova validates the token in keystone and shows it's 'fresh' and will expire only in about an hour: http://paste.openstack.org/show/444061/

3) moments later validation of the very same token fails in keystone (when nova talks to glance and the latter tries to validate the token): http://paste.openstack.org/show/444062/

This doesn't seem to be token expiry.

Igor Shishkin (teran)
Changed in mos:
milestone: 7.0 → 8.0
Revision history for this message
Alexander Makarov (amakarov) wrote :

Adding link: https://bugs.launchpad.net/fuel/+bug/1515350
Root cause is network failure

Changed in mos:
status: Confirmed → Triaged
Revision history for this message
Boris Bobrov (bbobrov) wrote :

I think this is not a problem since we moved to Fernet tokens in 8.0

Changed in mos:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.