Mitaka: dashboard performance

Bug #1587777 reported by eblock@nde.ag on 2016-06-01
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Medium
Unassigned

Bug Description

Environment: Openstack Mitaka on top of Leap 42.1, 1 control node, 2 compute nodes, 3-node-Ceph-cluster.

Issue: Since switching to Mitaka, we're experiencing severe delays when accessing the dashboard - i.e. switching between "Compute - Overview" and "Compute - Instances" takes 15+ seconds, even after multiple invocations.

Steps to reproduce:
1. Install Openstack Mitaka, incl. dashboard & navigate through the dashboard.

Expected result:
Browsing through the dashboard with reasonable waiting times.

Actual result:
Refreshing the dashboard can take up to 30 secs, switching between views (e.g. volumes to instances) takes about 15 secs in average.

Additional information:
I've had a look at the requests, the Apache logs and our control node's stats and noticed that it's a single call that's taking all the time... I see no indications of any error, it seems that once WSGI is invoked, that call simply takes its time. Intermediate curl requests are logged, so I see it's doing its work. Looking at "vmstat" I can see that it's user space taking all the load (Apache / mod_wsgi drives its CPU to 100%, while other CPUs are idle - and no i/o wait, no system space etc.).

---cut here---
control1:/var/log # top
top - 10:51:35 up 8 days, 18:16, 2 users, load average: 2,17, 1,65, 1,48
Tasks: 383 total, 2 running, 381 sleeping, 0 stopped, 0 zombie
%Cpu0 : 31,7 us, 2,9 sy, 0,0 ni, 65,0 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu1 : 13,1 us, 0,7 sy, 0,0 ni, 86,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 17,2 us, 0,7 sy, 0,0 ni, 81,2 id, 1,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 69,4 us, 12,6 sy, 0,0 ni, 17,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 14,6 us, 1,0 sy, 0,0 ni, 84,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 16,9 us, 0,7 sy, 0,0 ni, 81,7 id, 0,7 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu6 : 17,3 us, 1,3 sy, 0,0 ni, 81,0 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu7 : 21,2 us, 1,3 sy, 0,0 ni, 77,5 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 65943260 total, 62907676 used, 3035584 free, 1708 buffers
KiB Swap: 2103292 total, 0 used, 2103292 free. 53438560 cached Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 6776 wwwrun 20 0 565212 184504 13352 S 100,3 0,280 0:07.83 httpd-prefork
 1130 root 20 0 399456 35760 22508 S 5,980 0,054 818:13.17 X
 1558 sddm 20 0 922744 130440 72148 S 5,316 0,198 966:03.82 sddm-greeter
20999 nova 20 0 285888 116292 5696 S 2,658 0,176 164:27.08 nova-conductor
21030 nova 20 0 758752 182644 16512 S 2,658 0,277 58:20.40 nova-api
18757 heat 20 0 273912 73740 4612 S 2,326 0,112 50:48.72 heat-engine
18759 heat 20 0 273912 73688 4612 S 2,326 0,112 4:27.54 heat-engine
20995 nova 20 0 286236 116644 5696 S 2,326 0,177 164:38.89 nova-conductor
21027 nova 20 0 756204 180752 16980 S 2,326 0,274 58:20.09 nova-api
21029 nova 20 0 756536 180644 16496 S 2,326 0,274 139:46.29 nova-api
21031 nova 20 0 756888 180920 16512 S 2,326 0,274 58:36.37 nova-api
24771 glance 20 0 2312152 139000 17360 S 2,326 0,211 24:47.83 glance-api
24772 glance 20 0 631672 111248 4848 S 2,326 0,169 22:59.77 glance-api
28424 cinder 20 0 720972 108536 4968 S 2,326 0,165 28:31.42 cinder-api
28758 neutron 20 0 317708 101812 4472 S 2,326 0,154 153:45.55 neutron-server

#####################################

control1:/var/log # vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 1 0 0 2253144 1708 53440472 0 0 4 60 4 4 11 1 88 0 0
 0 0 0 2255588 1708 53440476 0 0 0 568 3063 7627 15 1 83 0 0
 1 0 0 2247596 1708 53440476 0 0 0 144 3066 6803 14 2 83 0 0
 1 0 0 2156008 1708 53440476 0 0 0 72 3474 7193 25 3 72 0 0
 2 0 0 2131968 1708 53440484 0 0 0 652 3497 8565 28 2 70 0 0
 3 1 0 2134000 1708 53440512 0 0 0 14340 3629 10644 25 2 71 2 0
 2 0 0 2136956 1708 53440580 0 0 0 12 3483 10620 25 2 70 3 0
 9 1 0 2138164 1708 53440596 0 0 0 248 3442 9980 27 1 72 0 0
 4 0 0 2105160 1708 53440628 0 0 0 428 3617 22791 27 2 70 1 0
 3 0 0 2093416 1708 53440644 0 0 0 1216 3502 7917 25 2 72 1 0
 3 0 0 2096344 1708 53440636 0 0 0 72 3555 9216 25 2 73 0 0
 6 0 0 2073564 1708 53440636 0 0 0 72 3587 11160 28 2 70 0 0
 2 0 0 2070236 1708 53440636 0 0 0 432 3854 8160 26 4 70 0 0
 1 0 0 2103628 1708 53440640 0 0 0 76 3407 7492 25 3 73 0 0
 3 0 0 2100320 1708 53440636 0 0 0 1384 3383 7955 24 2 73 1 0
 3 0 0 2100648 1708 53440644 0 0 0 216 3571 12276 33 2 65 0 0
 1 0 0 2091808 1708 53440672 0 0 0 72 3734 12848 36 2 62 0 0
 1 0 0 2092400 1708 53440700 0 0 0 0 4029 16556 23 1 77 0 0
 1 0 0 2092400 1708 53440696 0 0 0 72 2938 6621 14 1 85 0 0
 0 0 0 2092080 1708 53440696 0 0 0 704 3067 6958 15 1 83 1 0
 0 0 0 2092048 1708 53440696 0 0 0 360 3222 7279 14 1 85 1 0

---cut here---

Current test, switching to "Volumes" took 7 secs loading http://control1/project/volumes/. Switching back to "Instances" took 14.5 secs for http://control1/project/instances/. Here is the output from /var/log/apache2/openstack-dashboard-access_log, we added the duration to the log file.

---cut here---
[01/Jun/2016:09:50:13 +0200] "GET /project/volumes/ HTTP/1.1" 200 42683 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 6905 ms
[01/Jun/2016:09:50:20 +0200] "GET /static/dashboard/css/b66201fdb62c.css HTTP/1.1" 200 856329 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 25 ms
[01/Jun/2016:09:50:20 +0200] "GET /static/dashboard/js/63c5c82dd695.js HTTP/1.1" 200 1194379 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 270 ms
[01/Jun/2016:09:50:20 +0200] "GET /i18n/js/horizon+openstack_dashboard/ HTTP/1.1" 200 60400 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 1622 ms
[01/Jun/2016:09:50:38 +0200] "GET /project/instances/ HTTP/1.1" 200 43547 "http://control1/project/volumes/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 14576 ms
[01/Jun/2016:09:50:53 +0200] "GET /static/dashboard/css/b66201fdb62c.css HTTP/1.1" 200 856329 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 22 ms
[01/Jun/2016:09:50:53 +0200] "GET /static/dashboard/js/63c5c82dd695.js HTTP/1.1" 200 1194379 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 32 ms
[01/Jun/2016:09:50:53 +0200] "GET /i18n/js/horizon+openstack_dashboard/ HTTP/1.1" 200 60400 "http://control1/project/instances/" "Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0" Duration 46 ms
---cut here---

We did not see this with previous releases, we used Liberty before and upgraded to Mitaka.
Just for reference, we're running the following version of Horizon:

--- cut here ---
control1:/etc/apache2/conf.d # rpm -qi openstack-dashboard-9.0.1~a0~dev12-2.1.noarch
Name : openstack-dashboard
Version : 9.0.1~a0~dev12
Release : 2.1
Architecture: noarch
Install Date: Tue May 10 12:39:04 2016
Group : Development/Languages/Python
Size : 51400386
License : Apache-2.0
Signature : RSA/SHA256, Mon May 9 11:07:15 2016, Key ID 893a90dad85f9316
Source RPM : openstack-dashboard-9.0.1~a0~dev12-2.1.src.rpm
Build Date : Mon May 9 11:06:31 2016
Build Host : build81
Relocations : (not relocatable)
Vendor : obs://build.opensuse.org/Cloud:OpenStack
URL : http://wiki.openstack.org/OpenStackDashboard
--- cut here ---

Rob Cresswell (robcresswell) wrote :

Do you have any logs indicating how long the service requests etc are taking? Its going to be fairly difficult for us to narrow down the cause otherwise, if we are just told that a page load takes some time, as there is obviously a lot that occurs within that page.

Changed in horizon:
status: New → Incomplete
eblock@nde.ag (eblock) wrote :

Of course, I just had to reproduce it as I already had made changes to the debug settings etc. Please find attached the openstack-dashboard-error_log from control node. The excerpt is from switching from volumes to instances view.

eblock@nde.ag (eblock) wrote :

Just another comment: I think the relevant pid is [pid 3887], I simply attached the whole debug output for reference.

eblock@nde.ag (eblock) wrote :

This bug is marked for expiration, it seems that nothing is happening here. I added some logs as requested, are they sufficient to change the status from incomplete to confirmed or any other status?

I have more details now. First, the settings described in http://www.symantec.com/connect/blogs/domain-support-horizon-here (I was trying to use multi-domains) helped to improve the performance significantly, in detail it's the cache and engine setting:

---cut here---
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': '127.0.0.1:11211',
    },
}
SESSION_ENGINE = 'django.contrib.sessions.backends.cache'
---cut here---

Second, this improvement only helps if I'm navigating within the project panel (http://control/project/instances/). If I'm admin and want to navigate within the admin panel (http://control/admin/instances/) to see all resources across the existing projects I still have really long waiting times, the highest value was 43 seconds to load a list of 19 instances, from which only 9 are running. Loading all volumes took about 19 seconds.

Is there anything else I can provide to solve this issue?

Matthias Runge (mrunge) wrote :

how's your keystone set up? Are you using admin/overview? MAKE SURE YOUR DNS WORKS, in best case, insert your controllers into /etc/hosts to reduce latency here.

eblock@nde.ag (eblock) wrote :

I added the controller into /etc/hosts right from the start, DNS works, too. My keystone.conf:

---cut here---
[DEFAULT]
log_dir = /var/log/keystone
debug = true
[assignment]
driver = sql
[catalog]
driver = sql
[database]
connection = mysql+pymysql://keystone:password@control/keystone
[fernet_tokens]
key_repository = /etc/keystone/fernet-keys/
max_active_keys = 3
[identity]
domain_specific_drivers_enabled = true
domain_configurations_from_database = true
driver = sql
[memcache]
servers = localhost:11211
[resource]
driver = sql
[revoke]
driver = sql
[role]
driver = sql
[token]
expiration = 14400
provider = fernet
---cut here---

Please note that I use multi-domains, "real users" are included via LDAP, the openstack services are in the sql databases. But we had the performance issue right from the start when switching to Mitaka, at that point we didn't use LDAP and we used UUID tokens, not fernet. Basically, I upgraded the working Liberty environment to Mitaka and then started changing the configuration for keystone etc.

Please let me know if you need any other information.

Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Dashboard (Horizon) because there has been no activity for 60 days.]

Changed in horizon:
status: Incomplete → Expired
eblock@nde.ag (eblock) wrote :

We were finally able to take a closer look into this issue, and we found a bug in the install guide for openSUSE/SLES [1]. But it's not only a missing memcached configuration (if you change the CACHE LOCATION in the dashboard section of the install guide).
In our Liberty/Mitaka keystone.conf we used these memcache settings:

---cut here---
[memcache]

# Memcache servers in the format of "host:port". (list value)
servers = localhost:11211
---cut here---

and the default memcached settings:

---cut here---
control1:~ # cat /etc/sysconfig/memcached
## Path: Network/WWW/Memcached
## Description: start parameters for memcached.
## Type: string
## Default: "-l 127.0.0.1"
## Config: memcached
#
# start parameters for memcached.
#
# see man 1 memcached for more
#
MEMCACHED_PARAMS="-l 127.0.0.1"
---cut here---

So it was not wrong as it's not deprecated yet:

---cut here---
# This option is deprecated for removal since O.
# Its value may be silently ignored in the future.
# Reason: This option has been deprecated in the O release and will be removed
# in the P release. Use oslo.cache instead.
---cut here---

But as soon as we switched to oslo.cache and used

memcache_servers = localhost:11211 (or 127.0.0.1:11211)

the performance increased significantly! Instead of loading times of 30 seconds and more (or worse, gateway timeouts for /admin/instances) the pages load within a few seconds now, 8 to 10 seconds for /admin/instances. I still think this is a bug... Anyway, now you know.

[1] https://bugs.launchpad.net/openstack-manuals/+bug/1682439

eblock@nde.ag (eblock) on 2017-04-13
Changed in horizon:
status: Expired → New
Akihiro Motoki (amotoki) wrote :

Is it a keystone issue now?

Akihiro Motoki (amotoki) wrote :

After following the comments above, it looks related to keystone not horizon now. Changing the project.

affects: horizon → keystone
eblock@nde.ag (eblock) wrote :

Thanks for looking into it, I believe you're right about changing it to keystone.

Lance Bragstad (lbragstad) wrote :

It looks like the configuration in comment #8 mitigated the performance issues. Is there anything else needed from a keystone perspective here? The memcache section was listed as deprecated in the Mitaka release notes [0].

[0] https://docs.openstack.org/releasenotes/keystone/mitaka.html#deprecation-notes

Lance Bragstad (lbragstad) wrote :

Marking this as invalid since it's been a while without an update. Please feel free to reopen this report if the issue resurfaces.

Changed in keystone:
status: New → Invalid
eblock@nde.ag (eblock) wrote :

I changed the status back to new because the docs for Ocata still don't contain the required information to configure memcached correctly, please note comment #8.

Changed in keystone:
status: Invalid → New
eblock@nde.ag (eblock) wrote :

I just noticed that there was a fix for obs guide already, closing bug 1682439 [0]. I don't see the changes in Ocata install guide, that's why I changed the status of this bug. Aren't these fixes applied to older releases?

[0] https://bugs.launchpad.net/openstack-manuals/+bug/1682439

Lance Bragstad (lbragstad) wrote :

I'm not sure how backports are applied in the case of the openstack-manuals project. That'd be a good question for Alex.

Incoming me!

Backports are applied manually. We use a backport line (see: https://docs.openstack.org/contributor-guide/additional-git-workflow/backport.html) to track that work.

I apologise, because it is my fault that this has not been backported. We do rely quite heavily on the "Milestone" function (up the top) to determine what release the bug is applicable to.

I have now backported to Newton and Ocata: https://review.openstack.org/#/c/486943/ and https://review.openstack.org/#/c/486942/

Morgan Fainberg (mdrnstm) wrote :

I am marking this bug closed as the two patches in #17 have merged (inc. the backport).

Changed in keystone:
status: New → Fix Released
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers