OpenStack Identity (keystone)

Mitaka: dashboard performance

Bug #1587777 reported by eblock@nde.ag on 2016-06-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Identity (keystone)	Fix Released	Medium	Unassigned

Bug Description

Environment: Openstack Mitaka on top of Leap 42.1, 1 control node, 2 compute nodes, 3-node-Ceph-cluster.

Issue: Since switching to Mitaka, we're experiencing severe delays when accessing the dashboard - i.e. switching between "Compute - Overview" and "Compute - Instances" takes 15+ seconds, even after multiple invocations.

Steps to reproduce:
1. Install Openstack Mitaka, incl. dashboard & navigate through the dashboard.

Expected result:
Browsing through the dashboard with reasonable waiting times.

Actual result:
Refreshing the dashboard can take up to 30 secs, switching between views (e.g. volumes to instances) takes about 15 secs in average.

Additional information:
I've had a look at the requests, the Apache logs and our control node's stats and noticed that it's a single call that's taking all the time... I see no indications of any error, it seems that once WSGI is invoked, that call simply takes its time. Intermediate curl requests are logged, so I see it's doing its work. Looking at "vmstat" I can see that it's user space taking all the load (Apache / mod_wsgi drives its CPU to 100%, while other CPUs are idle - and no i/o wait, no system space etc.).

---cut here---
control1:/var/log # top
top - 10:51:35 up 8 days, 18:16, 2 users, load average: 2,17, 1,65, 1,48
Tasks: 383 total, 2 running, 381 sleeping, 0 stopped, 0 zombie
%Cpu0 : 31,7 us, 2,9 sy, 0,0 ni, 65,0 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu1 : 13,1 us, 0,7 sy, 0,0 ni, 86,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 17,2 us, 0,7 sy, 0,0 ni, 81,2 id, 1,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 69,4 us, 12,6 sy, 0,0 ni, 17,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 14,6 us, 1,0 sy, 0,0 ni, 84,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 16,9 us, 0,7 sy, 0,0 ni, 81,7 id, 0,7 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu6 : 17,3 us, 1,3 sy, 0,0 ni, 81,0 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu7 : 21,2 us, 1,3 sy, 0,0 ni, 77,5 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 65943260 total, 62907676 used, 3035584 free, 1708 buffers
KiB Swap: 2103292 total, 0 used, 2103292 free. 53438560 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6776 wwwrun 20 0 565212 184504 13352 S 100,3 0,280 0:07.83 httpd-prefork
1130 root 20 0 399456 35760 22508 S 5,980 0,054 818:13.17 X
1558 sddm 20 0 922744 130440 72148 S 5,316 0,198 966:03.82 sddm-greeter
20999 nova 20 0 285888 116292 5696 S 2,658 0,176 164:27.08 nova-conductor
21030 nova 20 0 758752 182644 16512 S 2,658 0,277 58:20.40 nova-api
18757 heat 20 0 273912 73740 4612 S 2,326 0,112 50:48.72 heat-engine
18759 heat 20 0 273912 73688 4612 S 2,326 0,112 4:27.54 heat-engine
20995 nova 20 0 286236 116644 5696 S 2,326 0,177 164:38.89 nova-conductor
21027 nova 20 0 756204 180752 16980 S 2,326 0,274 58:20.09 nova-api
21029 nova 20 0 756536 180644 16496 S 2,326 0,274 139:46.29 nova-api
21031 nova 20 0 756888 180920 16512 S 2,326 0,274 58:36.37 nova-api
24771 glance 20 0 2312152 139000 17360 S 2,326 0,211 24:47.83 glance-api
24772 glance 20 0 631672 111248 4848 S 2,326 0,169 22:59.77 glance-api
28424 cinder 20 0 720972 108536 4968 S 2,326 0,165 28:31.42 cinder-api
28758 neutron 20 0 317708 101812 4472 S 2,326 0,154 153:45.55 neutron-server

#####################################

control1:/var/log # vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
0 0 4 60 4 4 11 1 88 0 0
0 0 0 568 3063 7627 15 1 83 0 0
0 0 0 144 3066 6803 14 2 83 0 0
0 0 0 72 3474 7193 25 3 72 0 0
0 0 0 652 3497 8565 28 2 70 0 0
0 0 0 14340 3629 10644 25 2 71 2 0
0 0 0 12 3483 10620 25 2 70 3 0
0 0 0 248 3442 9980 27 1 72 0 0
0 0 0 428 3617 22791 27 2 70 1 0
0 0 0 1216 3502 7917 25 2 72 1 0
0 0 0 72 3555 9216 25 2 73 0 0
0 0 0 72 3587 11160 28 2 70 0 0
0 0 0 432 3854 8160 26 4 70 0 0
0 0 0 76 3407 7492 25 3 73 0 0
0 0 0 1384 3383 7955 24 2 73 1 0
0 0 0 216 3571 12276 33 2 65 0 0
0 0 0 72 3734 12848 36 2 62 0 0
0 0 0 0 4029 16556 23 1 77 0 0
0 0 0 72 2938 6621 14 1 85 0 0
0 0 0 704 3067 6958 15 1 83 1 0
0 0 0 360 3222 7279 14 1 85 1 0

---cut here---

Current test, switching to "Volumes" took 7 secs loading http://control1/project/volumes/. Switching back to "Instances" took 14.5 secs for http://control1/project/instances/. Here is the output from /var/log/apache2/openstack-dashboard-access_log, we added the duration to the log file.

---cut here---
[01/Jun/2016:09:50:13 +0200] "GET /project/volumes/ HTTP/1.1" 200 42683 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 6905 ms
[01/Jun/2016:09:50:20 +0200] "GET /static/dashboard/css/b66201fdb62c.css HTTP/1.1" 200 856329 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 25 ms
[01/Jun/2016:09:50:20 +0200] "GET /static/dashboard/js/63c5c82dd695.js HTTP/1.1" 200 1194379 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 270 ms
[01/Jun/2016:09:50:20 +0200] "GET /i18n/js/horizon+openstack_dashboard/ HTTP/1.1" 200 60400 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 1622 ms
[01/Jun/2016:09:50:38 +0200] "GET /project/instances/ HTTP/1.1" 200 43547 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 14576 ms
[01/Jun/2016:09:50:53 +0200] "GET /static/dashboard/css/b66201fdb62c.css HTTP/1.1" 200 856329 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 22 ms
[01/Jun/2016:09:50:53 +0200] "GET /static/dashboard/js/63c5c82dd695.js HTTP/1.1" 200 1194379 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 32 ms
[01/Jun/2016:09:50:53 +0200] "GET /i18n/js/horizon+openstack_dashboard/ HTTP/1.1" 200 60400 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 46 ms
---cut here---

We did not see this with previous releases, we used Liberty before and upgraded to Mitaka.
Just for reference, we're running the following version of Horizon:

--- cut here ---
control1:/etc/apache2/conf.d # rpm -qi openstack-dashboard-9.0.1~a0~dev12-2.1.noarch
Name : openstack-dashboard
Version : 9.0.1~a0~dev12
Release : 2.1
Architecture: noarch
Install Date: Tue May 10 12:39:04 2016
Group : Development/Languages/Python
Size : 51400386
License : Apache-2.0
Signature : RSA/SHA256, Mon May 9 11:07:15 2016, Key ID 893a90dad85f9316
Source RPM : openstack-dashboard-9.0.1~a0~dev12-2.1.src.rpm
Build Date : Mon May 9 11:06:31 2016
Build Host : build81
Relocations : (not relocatable)
Vendor : obs://build.opensuse.org/Cloud:OpenStack
URL : http://wiki.openstack.org/OpenStackDashboard
--- cut here ---

Revision history for this message

Rob Cresswell (robcresswell-deactivatedaccount) wrote on 2016-06-03:

Do you have any logs indicating how long the service requests etc are taking? Its going to be fairly difficult for us to narrow down the cause otherwise, if we are just told that a page load takes some time, as there is obviously a lot that occurs within that page.

Changed in horizon:
status:	New → Incomplete

Revision history for this message

eblock@nde.ag (eblock) wrote on 2016-06-03:

openstack-dashboard-error_log Edit (23.1 KiB, text/plain)

Of course, I just had to reproduce it as I already had made changes to the debug settings etc. Please find attached the openstack-dashboard-error_log from control node. The excerpt is from switching from volumes to instances view.

Revision history for this message

eblock@nde.ag (eblock) wrote on 2016-06-03:

Just another comment: I think the relevant pid is [pid 3887], I simply attached the whole debug output for reference.

Revision history for this message

eblock@nde.ag (eblock) wrote on 2016-07-25:

This bug is marked for expiration, it seems that nothing is happening here. I added some logs as requested, are they sufficient to change the status from incomplete to confirmed or any other status?

I have more details now. First, the settings described in http://www.symantec.com/connect/blogs/domain-support-horizon-here (I was trying to use multi-domains) helped to improve the performance significantly, in detail it's the cache and engine setting:

---cut here---
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': '127.0.0.1:11211',
    },
}
SESSION_ENGINE = 'django.contrib.sessions.backends.cache'
---cut here---

Second, this improvement only helps if I'm navigating within the project panel (http://control/project/instances/). If I'm admin and want to navigate within the admin panel (http://control/admin/instances/) to see all resources across the existing projects I still have really long waiting times, the highest value was 43 seconds to load a list of 19 instances, from which only 9 are running. Loading all volumes took about 19 seconds.

Is there anything else I can provide to solve this issue?

Revision history for this message

Matthias Runge (mrunge) wrote on 2016-07-25:

how's your keystone set up? Are you using admin/overview? MAKE SURE YOUR DNS WORKS, in best case, insert your controllers into /etc/hosts to reduce latency here.

Revision history for this message

eblock@nde.ag (eblock) wrote on 2016-07-26:

I added the controller into /etc/hosts right from the start, DNS works, too. My keystone.conf:

---cut here---
[DEFAULT]
log_dir = /var/log/keystone
debug = true
[assignment]
driver = sql
[catalog]
driver = sql
[database]
connection = mysql+pymysql://keystone:password@control/keystone
[fernet_tokens]
key_repository = /etc/keystone/fernet-keys/
max_active_keys = 3
[identity]
domain_specific_drivers_enabled = true
domain_configurations_from_database = true
driver = sql
[memcache]
servers = localhost:11211
[resource]
driver = sql
[revoke]
driver = sql
[role]
driver = sql
[token]
expiration = 14400
provider = fernet
---cut here---

Please note that I use multi-domains, "real users" are included via LDAP, the openstack services are in the sql databases. But we had the performance issue right from the start when switching to Mitaka, at that point we didn't use LDAP and we used UUID tokens, not fernet. Basically, I upgraded the working Liberty environment to Mitaka and then started changing the configuration for keystone etc.

Please let me know if you need any other information.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-09-25:

[Expired for OpenStack Dashboard (Horizon) because there has been no activity for 60 days.]

Changed in horizon:
status:	Incomplete → Expired

Revision history for this message

eblock@nde.ag (eblock) wrote on 2017-04-13:

We were finally able to take a closer look into this issue, and we found a bug in the install guide for openSUSE/SLES [1]. But it's not only a missing memcached configuration (if you change the CACHE LOCATION in the dashboard section of the install guide).
In our Liberty/Mitaka keystone.conf we used these memcache settings:

---cut here---
[memcache]

# Memcache servers in the format of "host:port". (list value)
servers = localhost:11211
---cut here---

and the default memcached settings:

---cut here---
control1:~ # cat /etc/sysconfig/memcached
## Path: Network/WWW/Memcached
## Description: start parameters for memcached.
## Type: string
## Default: "-l 127.0.0.1"
## Config: memcached
#
# start parameters for memcached.
#
# see man 1 memcached for more
#
MEMCACHED_PARAMS="-l 127.0.0.1"
---cut here---

So it was not wrong as it's not deprecated yet:

---cut here---
# This option is deprecated for removal since O.
# Its value may be silently ignored in the future.
# Reason: This option has been deprecated in the O release and will be removed
# in the P release. Use oslo.cache instead.
---cut here---

But as soon as we switched to oslo.cache and used

memcache_servers = localhost:11211 (or 127.0.0.1:11211)

the performance increased significantly! Instead of loading times of 30 seconds and more (or worse, gateway timeouts for /admin/instances) the pages load within a few seconds now, 8 to 10 seconds for /admin/instances. I still think this is a bug... Anyway, now you know.

[1] https://bugs.launchpad.net/openstack-manuals/+bug/1682439

eblock@nde.ag (eblock) on 2017-04-13

Changed in horizon:
status:	Expired → New

Revision history for this message

Akihiro Motoki (amotoki) wrote on 2017-04-18:

Is it a keystone issue now?

Revision history for this message

Akihiro Motoki (amotoki) wrote on 2017-04-18:

#10

After following the comments above, it looks related to keystone not horizon now. Changing the project.

affects:

horizon → keystone

Revision history for this message

eblock@nde.ag (eblock) wrote on 2017-04-18:

#11

Thanks for looking into it, I believe you're right about changing it to keystone.

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2017-04-28:

#12

It looks like the configuration in comment #8 mitigated the performance issues. Is there anything else needed from a keystone perspective here? The memcache section was listed as deprecated in the Mitaka release notes [0].

[0] https://docs.openstack.org/releasenotes/keystone/mitaka.html#deprecation-notes

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2017-07-13:

#13

Marking this as invalid since it's been a while without an update. Please feel free to reopen this report if the issue resurfaces.

Changed in keystone:
status:	New → Invalid

Revision history for this message

eblock@nde.ag (eblock) wrote on 2017-07-13:

#14

I changed the status back to new because the docs for Ocata still don't contain the required information to configure memcached correctly, please note comment #8.

Changed in keystone:
status:	Invalid → New

Revision history for this message

eblock@nde.ag (eblock) wrote on 2017-07-13:

#15

I just noticed that there was a fix for obs guide already, closing bug 1682439 [0]. I don't see the changes in Ocata install guide, that's why I changed the status of this bug. Aren't these fixes applied to older releases?

[0] https://bugs.launchpad.net/openstack-manuals/+bug/1682439

Revision history for this message

Lance Bragstad (lbragstad) wrote on 2017-07-24:

#16

I'm not sure how backports are applied in the case of the openstack-manuals project. That'd be a good question for Alex.

Revision history for this message

Alexandra Settle (alexandra-settle) wrote on 2017-07-25:

#17

Incoming me!

Backports are applied manually. We use a backport line (see: https://docs.openstack.org/contributor-guide/additional-git-workflow/backport.html) to track that work.

I apologise, because it is my fault that this has not been backported. We do rely quite heavily on the "Milestone" function (up the top) to determine what release the bug is applicable to.

I have now backported to Newton and Ocata: https://review.openstack.org/#/c/486943/ and https://review.openstack.org/#/c/486942/