Fuel for OpenStack

Keystone become extremely slow after controller shutdown

Bug #1405549 reported by Artem Panchenko on 2014-12-25

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	MOS Keystone	Fuel for OpenStack 7.0
6.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 6.0-updates
6.1.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 6.1-updates
7.0.x	Fix Released	High	MOS Keystone	Fuel for OpenStack 7.0

Bug Description

Fuel version( 6.0 #56): http://paste.openstack.org/show/154432/

The problem described here https://bugs.launchpad.net/fuel/+bug/1340657 still exists, I reproduced that issue on bare metal lab. Here are my steps:

1. Create HA environment (Ubuntu+NeutronVlan+Ceph+ceilometer).
2. Add 3 controller+mongo and 4 copmute+ceph nodes.
3. Deploy changes. Run OSTF. All tests passed.
4. Shutdown some controller (or just block traffic to memcached that is running on it - `iptables -I INPUT -p tcp --dport 11211 -j DROP`)

Expected result:

- OS services continue to work fine, environment passes health checks

Actual result:

- lots of OSTF smoke tests fails by timeout "AssertionError: Failed to get to expected status. In error state."

When all (3) memcached instances are available keystone needs 0.1-0.2 sec to create new token:

http://paste.openstack.org/show/154430/

and opening of '/horizon/admin/networks/' URL in browser takes 2-3 seconds.

After destroying one of controllers keystone spends >3 seconds creating a token:

http://paste.openstack.org/show/154431/

Loading of '/horizon/admin/networks/' URL in browser takes 10-15 seconds.

I observed the same behaviour on CentOS clouds.

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-12-25:

I guess we should provide some workaround for 6.0 and document it in release notes. If one of controllers in HA environment is destroyed/deleted all OS services work slowly, especially Horizon dashboard, due to delays from keystone. Such issue can be manually fixed in the following way:

1. determine the IP address assigned to management interface of unavailable controller node (should be in /etc/hosts on alive nodes);
2. remove this address from list of memcached servers in keystone.conf on each controller (`DOWN_IP="192.168.0.5"; sed -r 's/'$DOWN_IP':11211,?//g' -i.bak /etc/keystone/keystone.conf`)
3. restart keystone daemon on all controllers (`service openstack-keystone restart || restart keystone`)

tags:

added: release-notes

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-12-25:

This test case is a little bit invalid as dropping of packets triggers not the same behaviour of networking subsystem compared to loss of controller. This issue shows that we need to improve memcached connection pooler algorithm.

Changed in mos:
milestone:	none → 6.1
assignee:	nobody → MOS Keystone (mos-keystone)
importance:	Undecided → High
status:	New → Triaged
no longer affects:	fuel

Miroslav Anashkin (manashkin) on 2014-12-25

summary:

- Kyestone become extremely slow after controller shutdown
+ Keystone become extremely slow after controller shutdown

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-01-29:

What is the status of the issue? Do we have a plans to take it in progress?

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-02-02:

a) Need to check behaviour of 'iptables drop' vs 'host down' strategies
b) Redis token backend is on the way and will completely repace memcached backend causing this bug

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-02:

Does (a) mean that the issue should be elaborated by QA team?
Could you please also link the blueprint for Redis token backend? I found only this one https://blueprints.launchpad.net/fuel/+spec/ceilometer-central-agent-ha, but it does not cover the case in general

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-02:

Assigned this bug to QA team as it requires additional details about the 'iptables drop' vs 'host down' testing methods we have to stick with.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-02-02:

My blueprint is https://blueprints.launchpad.net/keystone/+spec/redis-storage-backend
Don't know how to actually _link_ it to the bug :)

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-02-13:

Confirmed different behaviour of host without memcached server running and host with memcached port blocked by iptables:
http://paste.openstack.org/show/172832/

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-02-13:

#10

I suggest to stop memcached rather than block it's port with iptables.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-02-16:

#11

Testing with iptables -j DROP suits better when testing "host down" scenario, as DROP drops packects silently. In case of stopped memcached linux usually will generate tcp rst on connecting to its closed port immidiately, so you won't get real socket timeout, as with -j REJECT iptables rule. In real scenario there may be even a bigger timeout, for example because of ARP.

Revision history for this message

Maksym Strukov (unbelll) wrote on 2015-03-11:

#12

Env: ubuntu (3 controllers + 3 computes)

Steps:
1. ssh to one of the slave nodes
2. run curl to vip management ip

while :; do curl -s -w "%{http_code} %{time_total}" -i -X POST -H "Content-Type: application/json" -d '{"auth":{ "identity":{"methods":["password"],"password"
:{"user":{"name":"admin","domain":{"id": "default" },"password": "admin"}}}}}' http://10.109.2.2:5000/v3/auth/tokens -o /dev/null; echo; sleep 1; done

sample result:
201 0.108
201 0.110
201 0.093
201 0.102
201 0.103

3. shutdown one of controllers (shutdown -P now)

sample result from curl:
201 1.080
201 0.391
201 0.102
201 0.110
201 0.303
201 0.123
201 0.107
502 0.689
201 0.133
201 0.158
503 3.005
201 0.101
201 0.093
503 40.004
201 0.205
201 0.127
201 3.188
201 3.135
201 0.304
201 0.110
201 3.186
201 0.122
201 3.098
201 0.105
201 0.102
201 0.177
201 0.094
201 3.108
201 3.101
201 3.124
201 3.105
201 3.128
201 0.148
201 0.214
201 0.086
201 0.101
201 0.090
201 0.157
201 0.092
201 3.118
201 0.095
201 3.112
201 3.141
201 3.173

Expected result:
No changes in delays if one of controllers shutted down

Actual result:
Delay changed from 0.1s to 3s (in 30%)

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-03-11:

#13

It is a configuration issue: 3 sec delay is a configured timeout value.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-03-19:

#14

Background:

Memcache connection pool is a queue of memcache clients.
Every memcache client has an array of memcache server connections to shard data across.
Every memcache connection has status information such as 'deaduntil' value indicating time until this connection considered down.
When one of remote memcahce hosts is down any attempt to connect to it fails by timeout which is configured with [cache]memcache_socket_timeout; corresponding connection marked as deaduntil=now+[cache]memcache_dead_retry
Every time a client is popped from the pool it is populated with information about dead servers.
Every time a client is returned to the pool the pool information about dead servers is updated with the one from client.

Problem description:

Clients are not getting populated with information that a remote memcache is down until the first client to experience the problem is returned to the pool, so clients will try to connect to a dead memcache server during about twice socket_timeout time.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-19:

#15

One of our suspects is a big number of keystone processes.

The information about dead hosts is not shared between processes. This results in the following workflow:

We have 3 controllers, each controller has, say, 6 keystone processes. Overall 18 keystones.

1. User goes to horizon, provides his credentials and presses "sign-in";
2. Horizon sends ~10 queries to keystone (this can be seen from Keystone logs), each query depends on the previous one.
3. Query 1 goes to keystone-1;
4. Keystone-1 doesn't know yet about dead memcache host. It probes that host, timeouts after 3 seconds and marks the host as dead. Overall time spent to sign-in is 3 seconds by now;
5. Horizon receives reply from keystone and makes another query;
6. Most likely, this request goes to keystone-2. keystone-2, being a separate process, doesn't know yet about dead memcache host. It probes that host, timeouts after 3 seconds and marks the host as dead. This results in 3 more seconds of time spent to sign-in, 6 seconds overall.
7. Horizon receives reply from keystone-2 and makes another query.
8. Most likely, this request goes to keystone-3. keystone-3, being a separate process, doesn't know yet about dead memcache hosts. [...] 9 seconds overall.

So, signing in takes 10 queries to keystone, each takes 3 seconds, 30 seconds overall.

What do you folks think?

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-03-19:

#16

Solution may consist of 2 parts:
1) Info shared across threads in a process
2) Info shared across processes
2a) Info shared across servers

2nd may be done by storing info replicated in memcached

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-19:

#17

Plan for fixing:

1. Try with small socket_timeout. socket_timeout=1.
2. If results are still not good enough, reduce the number of processes per controller.
3. If results are still not good enough, implement decimal socket_timeout and try with socket_timeout=0.5 (or less
4. If results are still not good enough, implement info sharing across processors

If one of the steps is successful, the resulting code should be tested on scale.

QA folks, could you please try reproducing the bug with socket_timeout=1? You need to set [memcache]socket_timeout = 1 and [cache]memcache_socket_timeout = 1.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-19:

#18

BTW, this might already be fixed by https://review.openstack.org/#/c/165203/1

Revision history for this message

Maksym Strukov (unbelll) wrote on 2015-03-19:

#19

Env: ubuntu (neutron vlan, 3 controllers + 3 computes)

{
  "build_id": "2015-03-19_11-00-33",
  "ostf_sha": "9435ec9f0aaf5bb8b9773347af871ab281b9ae78",
  "build_number": "207",
  "release_versions": {
    "2014.2-6.1": {
      "VERSION": {
        "build_id": "2015-03-19_11-00-33",
        "ostf_sha": "9435ec9f0aaf5bb8b9773347af871ab281b9ae78",
        "build_number": "207",
        "api": "1.0",
        "nailgun_sha": "7b4350afb2a13eba11c710aa30e6441ec2993fd5",
        "production": "docker",
        "python-fuelclient_sha": "5260d08776c981d3bd96b0b1d57efaeafeb06326",
        "astute_sha": "f7a93309473ca7673a6593f26f80a9d4fc2bac24",
        "feature_groups": [
          "mirantis"
        ],
        "release": "6.1",
        "fuelmain_sha": "fbfe367850a7e902d67d5ba05bfb9ed937a38dae",
        "fuellib_sha": "4fd580ae876a9565266807a73a4d4d7f02df9e23"
      }
    }
  },
  "auth_required": true,
  "api": "1.0",
  "nailgun_sha": "7b4350afb2a13eba11c710aa30e6441ec2993fd5",
  "production": "docker",
  "python-fuelclient_sha": "5260d08776c981d3bd96b0b1d57efaeafeb06326",
  "astute_sha": "f7a93309473ca7673a6593f26f80a9d4fc2bac24",
  "feature_groups": [
    "mirantis"
  ],
  "release": "6.1",
  "fuelmain_sha": "fbfe367850a7e902d67d5ba05bfb9ed937a38dae",
  "fuellib_sha": "4fd580ae876a9565266807a73a4d4d7f02df9e23"
}

Don't see positive changes:
http://paste.openstack.org/show/193632/

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-19:

#20

I asked Maxim for ssh credentials and logged in to the env. I added some debug prints around some code and it seems that everywhere where memcache does something, it may lead to lags. For example, http://paste.openstack.org/show/193705/ from keystone/common/kvs/backends/memcached.py leads to http://paste.openstack.org/show/193704/ (please note the times).

Sometimes there is a weird gap between releasing connections in memcache pool and start of acquiring of some lock: http://paste.openstack.org/show/193706/ (note the times).

Anyway, there is no much sense in steps 3 and 4 now.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-25:

#21

keystone.token.persistence.backends.memcache.Token should not be used as token persistence driver. driver=keystone.token.persistence.backends.memcache_pool.Token should be used.

Solution: change
[token]driver=keystone.token.persistence.backends.memcache.Token
to
[token]driver=keystone.token.persistence.backends.memcache_pool.Token

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-25:

#22

Typo alert:

Solution: change
[token]driver=keystone.token.backends.memcache.Token
to
[token]driver=keystone.token.persistence.backends.memcache_pool.Token

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-03-25:

#23

Folks, I am not sure what we are talking about? can we create a separate bug or describe, why we need to set these options? We already have cache/backend set to "keystone.cache.memcache_pool" for both 6.0 and 6.1. Is not it enough? If it is enough, then let's close this bug for Fuel project and leave it for MOS team to fix particular memcache_pool implementation.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-26: Fix proposed to fuel-library (master)

#24

Fix proposed to branch: master
Review: https://review.openstack.org/167863

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Davanum Srinivas (DIMS) (dims-v)
status:	New → In Progress

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-03-26:

#25

> can we create a separate bug or describe, why we need to set these options? We already have cache/backend set to "keystone.cache.memcache_pool" for both 6.0 and 6.1. Is not it enough?

You are talking about cache backend. There are many various backends in Keystone. For example, persistence drivers in tokens is one set of them. Now we use bad driver. We need to use a better one.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-27: Fix merged to fuel-library (master)

#26

Reviewed: https://review.openstack.org/167863
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ae7e0e1d3730ba8227c742efac3cd3e8b334d131
Submitter: Jenkins
Branch: master

commit ae7e0e1d3730ba8227c742efac3cd3e8b334d131
Author: Davanum Srinivas <email address hidden>
Date: Wed Mar 25 22:17:35 2015 -0400

Specify memcache_pool.Token as keystone token driver

Keystone team has a strong warning in their documentation:
http://docs.openstack.org/developer/keystone/configuration.html

    to use keystone.token.persistence.backends.memcache_pool.Token
    instead of keystone.token.persistence.backends.memcache.Token
    specifically when deploying Keystone under eventlet.

Change-Id: I674ee3440bb26af942db2859ed6f8512cbc22fee
Partial-Bug: #1405549

Revision history for this message

Maksym Strukov (unbelll) wrote on 2015-03-31:

#27

I applied https://review.openstack.org/#/c/167863/ to iso #234

Results:

201 0.113 -- three controllers active
201 0.171
201 0.119
201 0.165
201 0.108
201 0.092
201 0.099
201 0.097
201 0.106
201 0.201
201 0.113
201 3.125 -- master controller shutting down
201 4.114
503 40.006
201 1.103
201 1.094
201 4.104
201 4.102
201 0.118 -- normalization of delays after 20-30 seconds (only two controllers active)
201 0.091
201 0.103
201 0.135
201 0.092
201 0.143
201 0.102

Seems patch works for me.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-31: Fix proposed to fuel-library (stable/6.0)

#28

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/169248

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-31:

#29

I suggest to make an HCF exception for this patch to get it in the 6.0.1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-02: Fix merged to fuel-library (stable/6.0)

#30

Reviewed: https://review.openstack.org/169248
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ce8bb14fa1da0a964712826d743a3f2afd89b853
Submitter: Jenkins
Branch: stable/6.0

commit ce8bb14fa1da0a964712826d743a3f2afd89b853
Author: Davanum Srinivas <email address hidden>
Date: Wed Mar 25 22:17:35 2015 -0400

Specify memcache_pool.Token as keystone token driver

Keystone team has a strong warning in their documentation:
http://docs.openstack.org/developer/keystone/configuration.html

    to use keystone.token.persistence.backends.memcache_pool.Token
    instead of keystone.token.persistence.backends.memcache.Token
    specifically when deploying Keystone under eventlet.

    Change-Id: I674ee3440bb26af942db2859ed6f8512cbc22fee
    Partial-Bug: #1405549
    (cherry picked from commit ae7e0e1d3730ba8227c742efac3cd3e8b334d131)

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-04-03:

#31

Need to cherry-pick pappet changes ^^ from upstream.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-07:

#32

I removed the MOS project target as the issue was at the configuration side, which is Fuel

no longer affects:	mos
no longer affects:	mos/6.1.x
no longer affects:	mos/6.0.x

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-07:

#33

Note, there is a backporting guide for the recent memcached fixes on review https://review.openstack.org/#/c/171181/

Olena Logvinova (ologvinova) on 2015-05-28

tags:

added: release-notes-done
removed: release-notes

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-07-01:

#34

According to conversation with Andrey Grebennikov - looks like it's not fixed, at least for large number of CPUs. Andrey will provide more data shortly.

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2015-07-01:

#35

logs.zip Edit (1.1 KiB, application/zip)

The problem seems has not been resolved if the controllers have many cpus (as mentioned above).
Right now we have 82 keystone processes running on every controller. If we reboot the controller, the time of getting token is extended up to 4 seconds. We measure this time (using the script above) while the controller is out, and we keep seeing these numbers all the time.
If we set the number of public workers of keystone as 2 on every controller and do the reboot again - we get the time normalized in 30 seconds.
I attach the outputs of the script here.

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2015-07-01:

#36

last file includes the behaviour when the controller goes out, and the end - when it comes back. In fact during the whole process of reboot the time of getting token is very long - about 4 seconds.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-07-01:

#37

Fernet tokens will effectively remove the cause of the problem - the token storage.

Revision history for this message

Alexander Makarov (amakarov) wrote on 2015-07-01:

#38

https://mirantis.jira.com/browse/PROD-232

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-07-01:

#39

Hi,

we tested OpenStack clouds under the load with different count of API workers and different count of OpenStack controller nodes and we found that good numbers are the following:

We need to run 10-30 workers on each controller (count of workers depends on controller node configuration but it should be not more than 30, because in this case we will have degradation of performance - workers will be blocked by simultaneous operations with database/rabbit/other resources)

To scale the cloud we need to change the count of OpenStack controller nodes - for example, if we have 3 controllers and cloud can handle 300 users in parallel, we can add 2 controllers (to get 5 controllers) and this cloud will support more users, which can work in parallel.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-06:

#40

Fernet tokens for Kilo makes this bug invalid for the 7.0

Roman Alekseenkov (ralekseenkov) on 2015-07-23

tags:

added: customer-found

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-07-23:

#41

Are we already using Fernet tokens in 7.0? If not this needs to be solved for 7.0

Revision history for this message

Roman Alekseenkov (ralekseenkov) wrote on 2015-07-23:

#42

Moving back to Confirmed, as Fernet tokens haven't yet landed in 7.0

Revision history for this message

Roman Alekseenkov (ralekseenkov) wrote on 2015-07-23:

#43

I'd like us to be more mindful regarding bugs, especially customer-found.

We are closing bugs, "hoping" that they will be resolved. This is not the way to go.

I'd like this issue to be resolved only after:
1) Fernet tokens land in 7.0
2) Development verifies the original scenario (Keystone become extremely slow after controller shutdown) and confirms that it's not reproducible in 7.0

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-07-30:

#44

This should be fixed by the patch for https://bugs.launchpad.net/mos/+bug/1471318 . Lets wait for QA verification which Vitaly Sedelnik has asked to perform.

Revision history for this message

Yaroslav Lobankov (ylobankov) wrote on 2015-07-31:

#45

I looks like the issue is fixed. I have deployed the following cluster:

Ubuntu, Neutron with VLAN segmentation, Ceph for volumes, Ceph for images, Ceph for ephemeral volumes, Ceilometer enabled, 3 (controller + mongo), 1 (compute + ceph-osd).

Deployment has successfully passed. What I have done:

1. I have run all Sanity, Functional and HA tests. All tests have passed.
2. I have gone to one of the controllers and executed the command from the bug description. The result is here http://paste.openstack.org/show/406576/ . As we can see, creation of tokens have taken 0.1-0.2 seconds on average. Opening the "/horizon/admin/networks/" tab in browser have taken 2-3 seconds on average.
3. I have destroyed one of the controllers. Waiting for 10 mins just in case.
4. I have repeated the step 1. All tests have passed except the "Check that required services are running" test and the "Check network connectivity from instance via floating IP" test (traceback is http://paste.openstack.org/show/406578/).
5. I have repeated the step 2. The result is here http://paste.openstack.org/show/406579/ . As we can see, creation of tokens have taken 0.1-0.2 seconds on average, too. Opening the "/horizon/admin/networks/" tab in browser have taken 2-3 seconds on average, too.

Revision history for this message

Yaroslav Lobankov (ylobankov) wrote on 2015-07-31:

#46

I have forgotten to mention that all controllers were VMs, the compute node was a baremetal server.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-31:

#47

Thank you, I'm setting this bug to Fix committed then so our QA team could verify it.

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-08-05:

#48

Setting to Incomplete for 6.0-upadates and 6.1 as it's not clear which patch fixed the original issues.

Boris, Bogdan - could you help to identify which patch needs to be backported from 7.0 to 6.0-updates and 6.1 to fix the issue?

Vasiliy Pleshakov (vpleshakov) on 2015-08-27

tags:

added: on-verification

Revision history for this message

Vasiliy Pleshakov (vpleshakov) wrote on 2015-09-10:

#49

Verified with fuel-7.0-256-2015-08-30_09-35-52.iso

tags:

removed: on-verification

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-01-19:

#50

I have tried to reproduce test results provided by Maksym Strukov (unbelll) on MOS 6.1 and found that request results are identical before and after controller shutdown. Please provide steps-to-reproduce for MOS 6.1.

All controllers are working:
# while :; do curl -s -w "%{http_code} %{time_total}" -i -X POST -H "Content-Type: application/json" -d '{"auth":{ "identity":{"methods":["password"],"password"
:{"user":{"name":"admin","domain":{"id": "default" },"password": "admin"}}}}}' http://10.88.0.82:5000/v3/auth/tokens -o /dev/null; echo; sleep 1; done
201 0.162
201 0.117
201 0.173
201 0.110
201 0.140
201 0.114
201 0.207
....

Shutted down one controller:
# while :; do curl -s -w "%{http_code} %{time_total}" -i -X POST -H "Content-Type: application/json" -d '{"auth":{ "identity":{"methods":["password"],"password"
:{"user":{"name":"admin","domain":{"id": "default" },"password": "admin"}}}}}' http://10.88.0.82:5000/v3/auth/tokens -o /dev/null; echo; sleep 1; done
201 0.147
201 0.137
201 0.127
201 0.151
201 0.149
201 0.115
201 0.117
...