Nova becomes slow after controller shutdown because 1 of memcached servers is unavailable

Bug #1498867 reported by Artem Panchenko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
Medium
MOS Nova
8.0.x
Won't Fix
Medium
MOS Nova
9.x
Fix Committed
Medium
MOS Nova

Bug Description

Fuel version info (7.0 build #298): http://paste.openstack.org/show/473444/

If some of memcached instances used by Nova are unavailable then all Nova API operations take >2x time, Horizon and CLI work slowly, for example here are results of Rally tests on baremetal lab:

# All (3) controller nodes online

+----------------------------------------------------------------------------------------+
| Response Times (sec) |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| action | min | median | 90%ile | 95%ile | max | avg | success | count |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| nova.list_servers | 0.239 | 0.283 | 0.319 | 0.361 | 1.084 | 0.297 | 100.0% | 601 |
| total | 0.24 | 0.283 | 0.319 | 0.361 | 1.084 | 0.297 | 100.0% | 601 |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
Load duration: 60.2184360027
Full duration: 70.9516170025

Opening of 'instances' tab in Horizon takes ~2 seconds (1 instance is running)

# 1 controller node is down, 2 controllers are online

+----------------------------------------------------------------------------------------+
| Response Times (sec) |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| action | min | median | 90%ile | 95%ile | max | avg | success | count |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| nova.list_servers | 0.228 | 0.356 | 3.257 | 3.285 | 6.003 | 1.421 | 100.0% | 121 |
| total | 0.228 | 0.356 | 3.257 | 3.285 | 6.003 | 1.422 | 100.0% | 121 |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
Load duration: 60.1724669933
Full duration: 104.123668909

Opening of 'instances' tab in Horizon takes ~12 seconds (1 instance is running), I extended logging in Apache and it shows that request hadling took >8 seconds:

192.168.0.2 - - [22/Sep/2015:17:24:15 +0000] "GET /horizon/admin/instances/ HTTP/1.1" 200 4851 "https://172.16.39.201/horizon/admin/volumes/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36" **8/8206028**

# 1 controller node is down, memcached usage is disabled for Nova services (I just commented 'memcached_servers' setting in nova.conf on alive controllers and restarted Nova services)

+----------------------------------------------------------------------------------------+
| Response Times (sec) |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| action | min | median | 90%ile | 95%ile | max | avg | success | count |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
| nova.list_servers | 0.217 | 0.253 | 0.55 | 1.277 | 2.669 | 0.39 | 100.0% | 460 |
| total | 0.217 | 0.253 | 0.55 | 1.277 | 2.669 | 0.391 | 100.0% | 460 |
+-------------------+-------+--------+--------+--------+-------+-------+---------+-------+
Load duration: 61.1688148975
Full duration: 89.9979431629

Opening of 'instances' tab in Horizon takes ~6 seconds (1 instance is running):

192.168.0.2 - - [22/Sep/2015:22:44:08 +0000] "GET /horizon/admin/instances/ HTTP/1.1" 200 4858 "https://172.16.39.201/horizon/auth/login/?next=/horizon/admin/instances/" "Moz
5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0" **5/5374409**

As you can see cloud API performance degradation is significant and it is not caused by Keystone (actually keystone also works little bit slower, but looks like it's expected behaviour, see bug #1405549 ). I tried to play with different options for memcached in nova.conf (e.g. memcache_pool_dead_retry and memcache_pool_socket_timeout) and tests results were the same, but maybe I missed something.

Steps to reproduce:

1. Deploy cloud with 3 controllers
2. Run benchmark tests for Nova API
3. Shutdown 1 of controllers
4. Run benchmark tests for Nova API
5. Compare benchmark tests results

Expected result: there is no performance degradation after controller shutdown or degradation is lower then 33%

Actual result: there is 80-percent degradation

Also I tested a case when 2 of 5 controllers nodes are down and the results were even worst - about 90% of performance degradation in Nova API, execution of simple CLI command 'nova list' took up to 15 seconds.

Changed in fuel:
milestone: none → 8.0
Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
assignee: MOS Nova (mos-nova) → MOS Keystone (mos-keystone)
tags: added: long-haul-testing
Changed in fuel:
assignee: MOS Keystone (mos-keystone) → MOS Nova (mos-nova)
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Revision history for this message
Michael Korolyov (mkorolyov) wrote :

Summary:
1. Nova doesn’t use Memcached in MOS 7.0 – it isn’t configured in nova.conf to use memcached
2. Keystone uses for Memcached 1 sec timeout and 30 sec dead_retry
3. The dead_retry is set to 30 sec due to FUEL requirements, it doesn’t like to wait for 300 sec at installation time

What need to be done:
1. Validate that Nova (keystone client) doesn’t use memcached
2. Validate that long delay (40 sec) comes from interactions with Keystone and it isn’t internal to Nova client

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

> 1. Nova doesn’t use Memcached in MOS 7.0 – it isn’t configured in nova.conf to use memcached

I'm not sure what memcached usage you mean, but AFAIK it's used by Nova at least for VNC console authentication:

root@node-1:~# grep ^memcache /etc/nova/nova.conf
memcached_servers=192.168.0.5:11211,192.168.0.7:11211,192.168.0.11:11211
root@node-1:~# netstat -atnp | awk '$5~/:11211/{split($7,pid,"/");print pid[1]}' | sort -u | xargs -n1 -i ps -p {} -o user= -o comm= -o pid=
keystone apache2 39076
keystone apache2 39077
keystone apache2 39078
keystone apache2 39079
keystone apache2 39080
keystone apache2 39081
keystone apache2 39082
keystone apache2 39083
keystone apache2 39084
keystone apache2 39085
keystone apache2 39086
keystone apache2 39087
nova nova-consoleaut 5739
nova nova-api 6609
nova nova-api 6610
nova nova-api 6611
nova nova-api 6612
nova nova-api 6613
nova nova-api 6614
nova nova-api 6615
nova nova-api 6616
nova nova-api 6637
nova nova-api 6638
nova nova-api 6639
nova nova-api 6640
nova nova-api 6641
nova nova-api 6642
nova nova-api 6643
nova nova-api 6644

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No longer fixing Medium bugs in 8.0

affects: fuel → mos
Changed in mos:
milestone: 8.0 → none
milestone: none → 8.0
tags: added: area-nova
removed: area-mos
tags: removed: nova
Changed in mos:
status: Confirmed → Won't Fix
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Should have been fixed by https://review.openstack.org/#/c/274468/ . We'll need to double check that.

Revision history for this message
Michael Korolyov (mkorolyov) wrote :

I'd like to recommend to configure only ONE memcached per Controller Server / Node for every Client, in this case Nova conf
It can use just "localhost" or "127.0.0.1".
If Controller will go down than another two Controllers / Nova will not be impacted at all.

Revision history for this message
Dina Belova (dbelova) wrote :

Added move-to-10.0 tag due to the fact bug was transferred from 9.0 to 10.0

tags: added: move-to-10.0
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Please check that this bug is reproducible on 9.0 ISO

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I gave it a try on the 9.0 release ISO:

http://paste.openstack.org/show/526959/

^ nova list timings are pretty consistent (I tried to turn off 1 of 3, 2 of 3 and 3 of 3 memcached daemons - response time only goes up when there are no memcached servers running, other than that it's just short spikes for one-two requests before it goes back to normal).

This must have been fixed in https://review.openstack.org/#/c/274468/

tags: removed: move-to-10.0
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

Looks like not fixed for 9.0, will double-check with the original scenario. Currently I'm doing tests on 1 controller out of 3 with no-quorum-policy=ignore to allow all pcs/crm resources to work. Everything works, but is significantly (> 10x) slower compared to when all controllers are running. Tried removing offline rabbit hosts from nova.conf -didn't help. Logs show the following timestamps for a given request:

[dsutyagin@dsutyagin-local var]$ grep req-33ba0db0-6072-429e-9245-9fc20462e260 log/nova/nova-api.log
2016-09-07 07:34:46.839 32065 DEBUG nova.api.openstack.wsgi [req-33ba0db0-6072-429e-9245-9fc20462e260 b30092137b424eeebb34d6e06b669f74 4ebb2779094a406db97e552a8d0f3ae2 - - -] Calling method '<bound method ServersController.detail of <nova.api.openstack.compute.servers.ServersController object at 0x7f6e7e163190>>' _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:699
2016-09-07 07:34:46.840 32065 DEBUG nova.compute.api [req-33ba0db0-6072-429e-9245-9fc20462e260 b30092137b424eeebb34d6e06b669f74 4ebb2779094a406db97e552a8d0f3ae2 - - -] Searching by: {'deleted': False} get_all /usr/lib/python2.7/dist-packages/nova/compute/api.py:2062
2016-09-07 07:34:53.244 32065 DEBUG oslo.cache._memcache_pool [req-33ba0db0-6072-429e-9245-9fc20462e260 b30092137b424eeebb34d6e06b669f74 4ebb2779094a406db97e552a8d0f3ae2 - - -] Memcached pool 140112431041168, thread 140112787171136: Marked host 10.20.124.196:11211 dead until 1473233993.09 _do_log /usr/lib/python2.7/dist-packages/oslo_cache/_memcache_pool.py:116
2016-09-07 07:34:53.245 32065 DEBUG oslo.cache._memcache_pool [req-33ba0db0-6072-429e-9245-9fc20462e260 b30092137b424eeebb34d6e06b669f74 4ebb2779094a406db97e552a8d0f3ae2 - - -] Memcached pool 140112431041168, thread 140112787171136: Marked host 10.20.124.195:11211 dead until 1473233990.09 _do_log /usr/lib/python2.7/dist-packages/oslo_cache/_memcache_pool.py:116
2016-09-07 07:34:53.347 32065 DEBUG nova.policy [req-33ba0db0-6072-429e-9245-9fc20462e260 b30092137b424eeebb34d6e06b669f74 4ebb2779094a406db97e552a8d0f3ae2 - - -] Policy check for os_compute_api:os-hide-server-addresses failed with credentials ...

There are other issues which slow down stuff as well - call from nova to neutron takes 5 seconds, Keystone is also slow, I'll look at these too.
But in this log section I see that nova does get_all and then in 7 seconds says about memcached, nothing in between with this req id.

Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

I've done the test in 9.0, memcached is not causing the slowdown. However if I completely poweroff one of the controllers, a visible latency increase appears. With two controllers out of 3 down, the latency increases even more. I have created a new bug for this issue - https://bugs.launchpad.net/mos/+bug/1621541.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.