[library] 'keystone' delays lead to unstable MOS operations

Bug #1340657 reported by Dennis Dmitriev
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Aleksandr Didenko
5.0.x
Won't Fix
High
Sergii Golovatiuk
5.1.x
Fix Released
Critical
Aleksandr Didenko
6.0.x
Fix Released
Critical
Aleksandr Didenko

Bug Description

Upstream bug: https://bugs.launchpad.net/keystone/+bug/1332058

Confirmed for OpenStack HA environment.

When several OS controller nodes are used with 'memcached' installed on each of them, each 'keystone' instance is configured to use all of the 'memcached' instances.
Thus, if one of the controller nodes became inaccessible, then whole cluster may cease to be workable because of a lot of delays in the memcached backend.

There is no acceptable workaround at the moment which would allow the use all of available 'memcached' instances without such kind of issue.

Changed in mos:
assignee: nobody → MOS Keystone (mos-keystone)
Changed in mos:
milestone: none → 5.0.1
Changed in fuel:
assignee: nobody → Fuel for Openstack (fuel)
Igor Marnat (imarnat)
Changed in mos:
assignee: MOS Keystone (mos-keystone) → Yuriy Taraday (yorik-sar)
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

May be similar to bug: https://bugs.launchpad.net/fuel/+bug/1337923

Steps to reproduce:

Steps to Reproduce:
1. Create HA cluster with CentOS, neutron vlan
2. Add 3 nodes with controller role
3. Add 2 nodes with compute role
4. Deploy the cluster
5. Run network verification
6. Run OSTF
7. turn of primarry controller
8. wait for about 5 minutes
9. Run OSTF

Expected results:
Only "Check that required services are running" failed,
All "Functional tests" and "HA tests" passed.

Actual results:
"Functional tests" that launch an instance are failed with "Timed out waiting to become ACTIVE..." errors.

To make sure that this OSTF behaviour is affected by 'keystone':

1. On the rest operational controllers remove IP of the primary controller (it is 192.168.10.3 in the following example) from the IPs lists in the 'keystone' config file /etc/keystone/keystone.conf:
...
[cache]
backend_argument=url:192.168.10.3,192.168.10.4,192.168.10.5
...
[memcache]
servers=192.168.10.3:11221,192.168.10.4:11221,192.168.10.5:11221
..

2. restart openstack-keystone service on all controllers
3. run OSTF

Changed in mos:
status: New → Invalid
Revision history for this message
Yuriy Taraday (yorik-sar) wrote :

This issue represents the problem in using cache as a storage: You shouldn't do that.

There are two places where Keystone can use Memcached: as a storage for tokens and as a cache for everything.

For caching one shouldn't use all memcached instances: I'd be more optimal to use only one instance that resides on the same controller. That would make controller less dependent from neighbor controllers and remove unnecessary network interactions.

For tokens one should just use PKI that doesn't use storage backend constantly. With introduction of revocation events persistent storage for them became obsolete.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Fuel configures memcached as a storage for keystone tokens only. The problem is when you have several memcache servers and use roundrobin mechanisms for API memcache client should iterate over all memcache backends to be able to find a key-value pair.

Changed in mos:
status: Invalid → Confirmed
Revision history for this message
Yuriy Taraday (yorik-sar) wrote :

As I said, you should use PKI instead.

Mike Scherbakov (mihgen)
tags: added: release-notes
Changed in mos:
milestone: 5.0.1 → 5.1
tags: added: keystone
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Perhaps we should try PKI mode once puppet-keystone upstream sync patches merged

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Dmitry Ilyin (idv1985)
summary: - 'keystone' delays lead to unstable MOS operations
+ [library] 'keystone' delays lead to unstable MOS operations
no longer affects: mos
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For PKI tokens case we could as well consider putting memcached servers under HAproxy and configure LB by specifying a single backend_argument then. Makes sense?

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Bogdan, please add action items for repair this issue.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Action items could be as the following:
* Merge keystone upstream manifests once other dependent modules (neutron, etc) synced as well
OR just use the keystone module we have in library now
* Configure keystone to be deployed with PKI mode instead of UUID (there are could be much better chances to do so using upstream manifests, though)
* Try to reproduce bug again

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

According to latest tests, this bug doesn't affect FUEL anymore

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

In order to fix the issue we should add libmemcached and python-pylibmc packages

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/112829

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

api: '1.0'
astute_sha: b52910642d6de941444901b0f20e95ebbcb2b2e9
auth_required: false
build_id: 2014-08-10_02-01-17
build_number: '420'
feature_groups:
- mirantis
fuellib_sha: d699fc178559e98cfd7d53b58478b46553ffe39e
fuelmain_sha: 9d4463400b4924159c978af43855e48bcf2a84b2
nailgun_sha: 2741cdc0f0615263db2f176899d406207ec4ac04
ostf_sha: acf52a59e04fa74d2ed2b68ea225f4d24403b264
production: docker
release: '5.1'

Can't confirm that this issue doesn't affect Fuel anymore. Just tested keystone performance on bare metal and got the following results, when all controllers are online:

[root@node-6 ~]# time for i in {1..10}; do keystone user-list &>/dev/null; done

real 0m6.785s
user 0m3.902s
sys 0m0.513s

and after primary controller shutdown:

[root@node-6 ~]# time for i in {1..10}; do keystone user-list &>/dev/null; done

real 1m31.950s
user 0m3.884s
sys 0m0.524s

As you can see the difference is huge and that's why all api calls become very slowly if one of controllers is down. When I redirect the packets sent to the inaccessible memcached instance to the localhost on running controllers (iptables -t nat -I OUTPUT -d 192.168.0.3 -p tcp --dport 11211 -j DNAT --to 127.0.0.1:11211) everything become fine.

Sergii, JFYI I've tried to use dogpile.cache.pylibmc driver for keystone (manually installed pylibmc 1.3.0 and latest libmemcached on controllers) and in case of primary controller (192.168.0.3) is down I always get the following error from keystone:

http://paste.openstack.org/show/92992/

 I also tried to use bmemcached as backend, but then it always returns 504 error if old primary controller is down:

http://paste.openstack.org/show/92994/

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Artem,

I have the identical results on my environment. Unfortunately, there is no easy fix. MOS Keystone team is working to specify backend behavior for pylibmc. It will allow Fuel operators to specify behavior more optimal than default one.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I marked https://bugs.launchpad.net/mos/+bug/1353419 as duplicate of this bug, I think this bug should also be marked as affecting mos.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Dmitry, on MOS side this bug is tracked in https://bugs.launchpad.net/keystone/+bug/1332058 I know that we decided not to use upstream bugs for MOS+Fuel dev, but that one was created before our decision took effect and we keep using it.

Revision history for this message
XiaBing Yao (yao3690093-o) wrote :

use haproxy can solve this problem.

this is my haproxy.cfg about memcache server
listen memcache
    bind controller:11221
    balance source
    mode tcp
    server controller1 controller1:11211 check inter 2000 rise 2 fall 3
    server controller2 controller2:11211 check inter 2000 rise 2 fall 3 backup

vim /etc/keystone/keystone.conf
[memcache]
servers = controller:11221

vim /etc/nova/nova.conf
[DEFAULT]
memcached_servers = controller:11221

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (master)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/112829

Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

Guys, I think that here is the issue of all problem in 5.0 and later releases of Fuel.
https://bugs.launchpad.net/keystone/+bug/1360446

And regarding memcached under HA. It wouldn't work under LB because memcached servers didn't synchronized between each other and have completely different data in each and every server. Here is comply of ways of solving that:
1. Spawn memcached using corosync in haproxy/VIP addresses namespaces on only one node. Using VIP MGMT to access haproxy.
2. I hate using of haproxy behind of memcached but the only way (As I see) is to have memcached under haproxy is to have one ACTIVE memcached abd and remaining in BACKUP/PASSIVE mode.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

VIP controlled by corosync for memcached is something we want to avoid. There are many problems especially with horizontal scalability as we'll be limited to one instance of memcached. We are trying to use pylibmc/libmemcached right now which will allow us to detect "dead" memcached servers, spread load across all memcached server, have a good horizontal scalability distributing load across all controllers.

Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

What is the appropriate text for the Release Notes? The current draft is in https://review.openstack.org/#/c/117338/3/pages/release-notes/v5-1/038-resolved-ha-issues.rst -- what I have for this issue is clearly not correct.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Meg, the fix for the bug is in progress currently. If we fully fix it, we will not need to put info about it in release notes for 5.1/5.0.2. For your reference, the same bug on MOS side is tracked here: https://bugs.launchpad.net/keystone/+bug/1332058

Revision history for this message
Yuriy Taraday (yorik-sar) wrote :

I've prepared a patch in Keystone that should also fix this problem: https://gerrit.mirantis.com/21408 .
This patch requires some changes in config (only changed/added config values presented):

[cache]
backend=keystone.cache.memcache_pool # turn on new cache backend
backend_argument=pool_maxsize:100 # add this one, keep "backend_argument=url:<...>" there

Other options might be needed in future as well (see commit message for details), but these two should be enough.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/118382

Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Aleksandr Didenko (adidenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/118382
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=203ef3179007cffe3236032e61ecbaf1cd20605f
Submitter: Jenkins
Branch: master

commit 203ef3179007cffe3236032e61ecbaf1cd20605f
Author: Vladimir Kuklin <email address hidden>
Date: Tue Sep 2 19:30:36 2014 +0400

    Change keystone memcache backend

    Use keystone memcache pool backend
    in order to fix scalability and
    failover problems

    Also fixing file_line "match" + "after" insertion logic.

    Change-Id: Idfe4b54caa0d96a93e93bfff12d8b6216f83e2f1
    Closes-bug: #1364401
    Closes-bug: #1340657

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Now that we've ported Yuriy's fix to 5.0.2, we need to adjust 5.0.2 puppets as well

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/122230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.0)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/122230

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

This change requires keystone package from 5.1 where Yuri Taraday added a new eventlet safe memcache driver for keystone.

no longer affects: fuel/6.1.x
no longer affects: fuel/5.1.x
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135629

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

OK, I've checked 5.1.1. It still has older version of keystone.cache.memcache_pool with old arguments (backend and backend_argument). So no changes needed for 5.1.1 and fix is already released in 5.1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/135629
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4900833f4d355fe729f20cc57aef341b4fb248c2
Submitter: Jenkins
Branch: master

commit 4900833f4d355fe729f20cc57aef341b4fb248c2
Author: Aleksandr Didenko <email address hidden>
Date: Wed Nov 19 17:32:00 2014 +0200

    Restore keystone cache.memcache_pool backend

    Restore lost change https://review.openstack.org/#/c/118382/
    Plus refactor to setup valid keystone.cache.memcache_pool params.

    Closes-bug: #1340657
    Change-Id: Id30ad74afe8cef17a6c42f6dfc5c9c581348bc1f

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #49

"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "auth_required": true, "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}}}, "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"

tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/143084

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Guys, seems the problem still exists, I reproduced this issue on bare metal lab. Here are my steps:

1. Create HA environment (Ubuntu+NeutronVlan+Ceph+ceilometer).
2. Add 3 controller+mongo and 4 copmute+ceph nodes.
3. Deploy changes. Run OSTF. All tests passed.
4. Shutdown some controller (or just block traffic to memcached that is running on it - `iptables -I INPUT -p tcp --dport 11211 -j DROP`)

Expected result:

 - OS services continue to work fine, environment passes health checks

Actual result:

 - lots of OSTF smoke tests fails by timeout "AssertionError: Failed to get to expected status. In error state."

When all (3) memcached instances are available keystone needs 0.1-0.2 sec to create new token:

http://paste.openstack.org/show/154430/

and opening of '/horizon/admin/networks/' URL in browser takes 2-3 seconds.

After destroying one of controllers keystone spends >3 seconds creating a token:

http://paste.openstack.org/show/154431/

Loading of '/horizon/admin/networks/' URL in browser takes 10-15 seconds.

Fuel version( 6.0 #56): http://paste.openstack.org/show/154432/

Unfortunately I can't provide diagnostic snapshot now because it's too huge, but I will try to reproduce the issue on another fresh environment and attach logs.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Artem please create other issue, looks like we have incorrect keystone behaviour here and discuss it with Vova K. he could provide more information.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.0)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/143084

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.