OpenStack Identity (keystone)

Keystone causing high mySQL usage

Bug #1182481 reported by Adam on 2013-05-21

This bug affects 12 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Identity (keystone)	Invalid	High	Unassigned

Bug Description

I used this install guide: https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst

(Grizzly)
1 controller/network node and 4 compute nodes running on Ubuntu 13.04 server 64bit. I have keystone running on the controller with a local mysql database (15K SAS drives).

As you can see below mysql is constantly running at 100% cpu utilization. Any api call that needs keystone authentication results in this.
10655 mysql 20 0 3970m 554m 7740 S 101.7 3.5 75:53.22 /usr/sbin/mysqld

API calls take 5-8 seconds to return a value. A simple nova list takes 5 seconds. I have checked the keystone logs and there is nothing getting logged. I enabled slow query logging in the my.cnf and noticed the following:

================================================================================================
SET timestamp=1368712775;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:32' AND token.valid = 0;
# Time: 130516 9:59:39
# User@Host: keystone[keystone] @ [10.70.0.1]
# Query_time: 3.804186 Lock_time: 0.000048 Rows_sent: 3 Rows_examined: 117374
SET timestamp=1368712779;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:36' AND token.valid = 0;
# Time: 130516 9:59:43
# User@Host: keystone[keystone] @ [10.70.0.1]
# Query_time: 3.800959 Lock_time: 0.000047 Rows_sent: 3 Rows_examined: 117375
SET timestamp=1368712783;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:39' AND token.valid = 0;
# Time: 130516 9:59:47
# User@Host: keystone[keystone] @ [10.70.0.1]
# Query_time: 3.810399 Lock_time: 0.000048 Rows_sent: 3 Rows_examined: 117377
SET timestamp=1368712787;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:43' AND token.valid = 0;
# Time: 130516 9:59:51
# User@Host: keystone[keystone] @ [10.70.0.1]
# Query_time: 3.871686 Lock_time: 0.000042 Rows_sent: 3 Rows_examined: 117377
SET timestamp=1368712791;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:47' AND token.valid = 0;
# Time: 130516 9:59:55
# User@Host: keystone[keystone] @ [10.70.0.1]
# Query_time: 3.928294 Lock_time: 0.000050 Rows_sent: 3 Rows_examined: 117379
SET timestamp=1368712795;
SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-16 13:59:51' AND token.valid = 0;
================================================================================================

As you can see the same query is running over and over taking about 4 seconds to complete. This is not a large OpenStack install at all. I have 3 Project with less than 20 Instances total.

Please help me determine if this is a bug or a configuration issue.

================================================================================================
keystone.conf

[DEFAULT]

log_file = keystone.log

# The directory to keep log files in (will be prepended to --logfile)
log_dir = /var/log/keystone

[sql]
connection = mysql://keystone:pass@10.70.0.1/keystone
# the timeout before idle sql connections are reaped
idle_timeout = 200

[identity]
driver = keystone.identity.backends.sql.Identity

[trust]
driver = keystone.trust.backends.sql.Trust

[catalog]
driver = keystone.catalog.backends.sql.Catalog

[token]
driver = keystone.token.backends.sql.Token

[policy]
driver = keystone.policy.backends.sql.Policy

[ec2]
driver = keystone.contrib.ec2.backends.sql.Ec2

[ssl]

[signing]

[ldap]

[auth]
methods = password,token
password = keystone.auth.plugins.password.Password
token = keystone.auth.plugins.token.Token

[filter:debug]
paste.filter_factory = keystone.common.wsgi:Debug.factory

[filter:token_auth]
paste.filter_factory = keystone.middleware:TokenAuthMiddleware.factory

[filter:admin_token_auth]
paste.filter_factory = keystone.middleware:AdminTokenAuthMiddleware.factory

[filter:xml_body]
paste.filter_factory = keystone.middleware:XmlBodyMiddleware.factory

[filter:json_body]
paste.filter_factory = keystone.middleware:JsonBodyMiddleware.factory

[filter:user_crud_extension]
paste.filter_factory = keystone.contrib.user_crud:CrudExtension.factory

[filter:crud_extension]
paste.filter_factory = keystone.contrib.admin_crud:CrudExtension.factory

[filter:ec2_extension]
paste.filter_factory = keystone.contrib.ec2:Ec2Extension.factory

[filter:s3_extension]
paste.filter_factory = keystone.contrib.s3:S3Extension.factory

[filter:url_normalize]
paste.filter_factory = keystone.middleware:NormalizingFilter.factory

[filter:sizelimit]
paste.filter_factory = keystone.middleware:RequestBodySizeLimiter.factory

[filter:stats_monitoring]
paste.filter_factory = keystone.contrib.stats:StatsMiddleware.factory

[filter:stats_reporting]
paste.filter_factory = keystone.contrib.stats:StatsExtension.factory

[filter:access_log]
paste.filter_factory = keystone.contrib.access:AccessLogMiddleware.factory

[app:public_service]
paste.app_factory = keystone.service:public_app_factory

[app:service_v3]
paste.app_factory = keystone.service:v3_app_factory

[app:admin_service]
paste.app_factory = keystone.service:admin_app_factory

[pipeline:public_api]
pipeline = access_log sizelimit stats_monitoring url_normalize token_auth admin_token_auth xml_body json_body debug ec2_extension user_crud_extension public_service

[pipeline:admin_api]
pipeline = access_log sizelimit stats_monitoring url_normalize token_auth admin_token_auth xml_body json_body debug stats_reporting ec2_extension s3_extension crud_extension admin_service

[pipeline:api_v3]
pipeline = access_log sizelimit stats_monitoring url_normalize token_auth admin_token_auth xml_body json_body debug stats_reporting ec2_extension s3_extension service_v3

[app:public_version_service]
paste.app_factory = keystone.service:public_version_app_factory

[app:admin_version_service]
paste.app_factory = keystone.service:admin_version_app_factory

[pipeline:public_version_api]
pipeline = access_log sizelimit stats_monitoring url_normalize xml_body public_version_service

[pipeline:admin_version_api]
pipeline = access_log sizelimit stats_monitoring url_normalize xml_body admin_version_service

[composite:main]
use = egg:Paste#urlmap
/v2.0 = public_api
/v3 = api_v3
/ = public_version_api

[composite:admin]
use = egg:Paste#urlmap
/v2.0 = admin_api
/v3 = api_v3
/ = admin_version_api
================================================================================================

Also note that I have an Essex environment up and running with similar equipment with no mysql performance issues.

Thanks in advance,
Adam

Revision history for this message

pippo (claudio-enjoy) wrote on 2013-05-23:

I have partially resolved , edit mysql config in my.cnf

and add innodb_buffer_pool_size default is 10M I set 1024M

the problem is on table keystone write too many queries and the cache mem is too low so it will write/read on disk that it s very slow than ram.
the tool 'mytop' monitor mysql very useful ;)

Revision history for this message

pippo (claudio-enjoy) wrote on 2013-05-24:

this resolve much better the problem with high cpu
check again if you have install ntp if not install it and then :

/etc/init.d/ntp stop
date -s "`date`"
/etc/init.d/ntp start

Revision history for this message

Adam (adam500) wrote on 2013-05-28:

Claudio,

This has helped. First I cleared the database of all the tokens older than 1 day and this made it snappy again. Then I tuned the my.cnf and added innodb_buffer_pool_size and I have not been seeing any slow queries.

Now note that my token table is only 70K rows right now. When I was having the issue it was over 250K. I will update this once I get back to 250K or greater rows.

Just seems that with the default mysql settings keystone will hang now that the tokens are encrypted in the database.

===============
I have tuned my /etc/mysql/my.cnf with the following:

key_buffer = 2048M
max_allowed_packet = 2048M
thread_stack = 512M
thread_cache_size = 1024
query_cache_limit = 2048M
query_cache_size = 128M
innodb_buffer_pool_size = 2048M

restart mysql after making the changes

This has helped greatly.

I also found that flushing the keystone tokens and only keeping ones that are less than a day old helped out.

Here is the mysql command:
USE keystone ; DELETE FROM token WHERE NOT DATE_SUB(CURDATE(),INTERVAL 1 DAY) <= expires;'

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-05-29:

This issue is now being seen in the perf lab (all baremetal – 32 cores, 256 GB RAM, SSD+SAN mixed storage). Looks like Keystone is messing up with environment by hogging the CPU – might be similar to this defect:
https://bugs.launchpad.net/keystone/+bug/1182481
Note: My DB is postgres with buffer (256 MB) and max_connections (800) are set optimally. There is no page out.

Keystone version:
Version: 1:2013.1-0ubuntu1.1~cloud0
Depends: python, upstart-job, python-keystone (= 1:2013.1-0ubuntu1.1~cloud0), adduser, ssl-cert (>= 1.0.12)

A typical response time (in seconds) from nova-api log shows:

>> Switching from Admin tab to Project tab
2013-05-28 18:12:08.932 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 52908)

2013-05-28 18:12:14.679 INFO nova.osapi_compute.wsgi.server [req-62e65521-d641-4c8f-a89d-fef317d1be8a 2efad4b253f64b4dae65a28f45438d93 4f342d62fff843fab63dc03316d34251] 10.1.56.12 "GET /v2/4f342d62fff843fab63dc03316d34251/servers/detail?project_id=4f342d62fff843fab63dc03316d34251 HTTP/1.1" status: 200 len: 22682 time: 5.7459810

>> Switching from Overview to Instances( with only 14 instances running)
2013-05-28 18:12:21.005 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 52968)

2013-05-28 18:12:28.297 INFO nova.osapi_compute.wsgi.server [req-8087be8b-b44c-45ab-a4bc-9572c0984e65 2efad4b253f64b4dae65a28f45438d93 4f342d62fff843fab63dc03316d34251] 10.1.56.12 "GET /v2/4f342d62fff843fab63dc03316d34251/servers/detail?project_id=4f342d62fff843fab63dc03316d34251 HTTP/1.1" status: 200 len: 22682 time: 7.2911589

>>Switching from Instances to Volume (with only 1 volume)
2013-05-28 18:16:30.220 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 53243)

2013-05-28 18:16:38.533 INFO nova.osapi_compute.wsgi.server [req-729f3817-0362-4de1-a362-ce3ee856faca 2efad4b253f64b4dae65a28f45438d93 4f342d62fff843fab63dc03316d34251] 10.1.56.12 "GET /v2/4f342d62fff843fab63dc03316d34251/servers/detail?project_id=4f342d62fff843fab63dc03316d34251 HTTP/1.1" status: 200 len: 22682 time: 8.3117819

I will be investigating more and keep you posted. In the meantime, if someone has any clues – let me know.

Keystone version:
Version: 1:2013.1-0ubuntu1.1~cloud0
Depends: python, upstart-job, python-keystone (= 1:2013.1-0ubuntu1.1~cloud0), adduser, ssl-cert (>= 1.0.12)

A typical response time (in seconds) from nova-api log shows:

>> Switching from Admin tab to Project tab
2013-05-28 18:12:08.932 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 52908)

>> Switching from Overview to Instances( with only 14 instances running)
2013-05-28 18:12:21.005 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 52968)

>>Switching from Instances to Volume (with only 1 volume) 
2013-05-28 18:16:30.220 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 53243)

I will be investigating more and keep you posted. In the meantime, if someone has any clues – let me know.

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-05-29:

See the attachment to check the CPU Usage Edit (18.4 KiB, image/png)

Revision history for this message

pippo (claudio-enjoy) wrote on 2013-05-29:

yes the older tokens are not delete for some reasons :

https://blueprints.launchpad.net/keystone/+spec/keystone-manage-token-flush

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-05-30:

In my scenario, token flush is not making much improvement. The queries are taking few miliseconds to finish:

2013-05-30 16:39:34 PDT,12998,[unknown]LOG: statement: SELECT token.id AS token_id, token.expires AS token_expires, token.extra AS token_extra, token.valid AS token_valid, token.user_id AS token_user_id, token.trust_id AS token_trust_id
FROM token
WHERE token.expires > '2013-05-30T23:39:34.813557'::timestamp AND token.valid = false
2013-05-30 16:39:34 PDT,12998,[unknown]LOG: duration: 19.337 ms

However, what am noticing here is if I de-provision all the launched instances, the navigation through various tabs finishes in miliseconds. If I keep launching VMs and at this point am on 100, the response time goes between 40-60 seconds. Same behavior is seen with nova list command. So, it looks a factor of instances or objects in the openstack.

nova list
2013-05-30 10:07:46.063 37206 INFO nova.osapi_compute.wsgi.server [-] (37206) accepted ('10.1.56.12', 40240)
2013-05-30 10:08:26.249 INFO nova.osapi_compute.wsgi.server [req-fa0cf5a9-270a-4a03-92fe-e0d124919e16 2efad4b253f64b4dae65a28f45438d93 4f342d62fff843fab63dc03316d34251] 10.1.56.12 "GET /v2/4f342d62fff843fab63dc03316d34251/servers/detail HTTP/1.1" status: 200 len: 153050 time: 40.1845958

Observation on nova list:
1. Overall response time to enumerate 100 launched instance takes 40 seconds
2. Throughout this 40 seconds duration, keystone-all is taking 70% cpu (in a 32 core system)
3. I took a count of keystone.token rows before and after I executed the command ‘nova list’. The counted bumped up from 46,671 to 46,884 (around 200 rows). Does it mean it is going to keystone for every instance enumeration.

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-05-31:

Is token flush not deleting expired tokens?

If you have token flush, I then assume you're using trunk and not grizzly, specifically? Related patch to try:

https://review.openstack.org/#/c/30753/

Unless there's an actionable issue we can pursue here, this should be converted to a question or closed.

Changed in keystone:
status:	New → Incomplete
importance:	Undecided → Medium

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-05-31:

Also: https://review.openstack.org/#/c/28866/

Revision history for this message

Hrushikesh (hrushikesh-gangur) wrote on 2013-06-04:

#10

I don't have the recent code. Tokens do pile up in the token table and i keep regularly flushing it manually. I have surely noticed that the transaction which typically takes 20 ms to finish goes up to 110 ms with 350 k rows in the token table. The system resources are good, hence this latency is nothing when compared to what "nova list" show up (40 seconds to enumerate 100 instances and 190 seconds for 450 instances). I guess I need to track this with a separate defect or Question.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-08-03:

#11

[Expired for Keystone because there has been no activity for 60 days.]

Changed in keystone:
status:	Incomplete → Expired

Revision history for this message

Adam (adam500) wrote on 2013-08-16:

#12

Update:

Now that our environment has grown. Simply deleting the expired tokens is not helping. I have found a temporary work around and that was to revert from PKI tokens to UUID based tokens.

The mySQL usage has droped from 110% to 10-15% and it is very quick now. If you are interested in making this change it is very simple. Edit your keystone.conf file and set the following:
token_format = UUID

Restart the keystone service and you are good to go.

This is just temporary as I am sure we would all like the added benefits of PKI.

Here are some interesting ideas that may be worth checking out:

http://blog.gridcentric.com/bid/318277/Boosting-OpenStack-s-Parallel-Performance I have not had the time to test them but plan to soon and will update this bug.

Thanks,
Adam

Revision history for this message

Brian Aker (brianaker) wrote on 2013-08-20:

#13

There is no index on the table being used to do token lookup.

You can automate the deleting of tokens via "CREATE EVENT....".

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-08-26:

#14

> There is no index on the table being used to do token lookup.

That would certainly explain the issue described here! If you're referring to the token.expires column, this should be fixed in master by migration 24.

Adam Young (ayoung) on 2013-10-08

Changed in keystone:
assignee:	nobody → Adam Young (ayoung)
status:	Expired → Confirmed

Revision history for this message

Doug Schaapveld (djschaap) wrote on 2013-11-06:

#15

+1 for adding an index. My grizzly test environment was completely unstable (many services had begun to intermittently time out - error 504) due to slow keystone response. Doing a simple "SELECT COUNT(*) FROM token" would take nearly a minute (approximately 80,000 rows; the oldest expiration date was only 9 days ago). After I created an index ("ALTER TABLE token ADD INDEX IX_token_expires (expires);"), that same select query returned instantly. I'm only on migration 22.

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-11-20:

#16

This was fixed in 2879d42d by migration 24, which adds an index to `token.expires`