fuel-ccp

Lack of free connections to memcached cause keystone middleware to stall

Bug #1653071 reported by Ilya Shakhat on 2016-12-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	fuel-ccp	Fix Released	High	Fuel CCP Bug Team

Bug Description

Deployment: bare-metal, 200 nodes (http://docs.openstack.org/developer/performance-docs/labs/intel_lab_1.html)
Software: Ubuntu 16.04, k8s by Kargo, services deployed in HA configuration (3 replicas), memcached on a separate node.

********************

It's observed that duration of requests to API services is distributed in following manner: 2/3 last 10-20ms, 1/3 last slightly more than 3 seconds (see chart attached). Even if the same request repeated many times it may finish fast or last long. The issue even affects requests that do not do any DB calls:

vagrant@node1:~$ time curl -g -iv -X GET http://neutron-server.ccp:9696/v2.0/ -H "User-Agent: fpython-neutronclient" -H "Accept: application/json" -H "X-Auth-Token: gAAAAABYYr8WzhWMk68aqk89WyIE71Z-dMoIb5_4Hl3PMZ3e2tEAsNjpRGWuzPKO6YL_OoIwM3Hh_A6gM6WifVMGMRQzy_dqTMA6b55okJ1SgdmnlCODiuJWQvrg0Yrn09C29eQQ17DR2z6RQq_Zi3vQiiX5Jk7-Uj3Swv3zh82zfEOHMSWdddk"
Note: Unnecessary use of -X or --request, GET is already inferred.
* Trying 10.224.233.196...
* Connected to neutron-server.ccp (10.224.233.196) port 9696 (#0)
> GET /v2.0/ HTTP/1.1
> Host: neutron-server.ccp:9696
> User-Agent: fpython-neutronclient
> Accept: application/json
> X-Auth-Token: gAAAAABYYr8WzhWMk68aqk89WyIE71Z-dMoIb5_4Hl3PMZ3e2tEAsNjpRGWuzPKO6YL_OoIwM3Hh_A6gM6WifVMGMRQzy_dqTMA6b55okJ1SgdmnlCODiuJWQvrg0Yrn09C29eQQ17DR2z6RQq_Zi3vQiiX5Jk7-Uj3Swv3zh82zfEOHMSWdddk
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Type: application/json
Content-Type: application/json
< Content-Length: 536
Content-Length: 536
< X-Openstack-Request-Id: req-7d402b7b-3ddb-4619-ab8c-c0a7ea573068
X-Openstack-Request-Id: req-7d402b7b-3ddb-4619-ab8c-c0a7ea573068
< Date: Tue, 27 Dec 2016 20:10:51 GMT
Date: Tue, 27 Dec 2016 20:10:51 GMT

<
* Connection #0 to host neutron-server.ccp left intact
{"resources": [{"links": [{"href": "http://neutron-server.ccp:9696/v2.0/subnets", "rel": "self"}], "name": "subnet", "collection": "subnets"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/subnetpools", "rel": "self"}], "name": "subnetpool", "collection": "subnetpools"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/networks", "rel": "self"}], "name": "network", "collection": "networks"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/ports", "rel": "self"}], "name": "port", "collection": "ports"}]}
real 0m3.069s
user 0m0.012s
sys 0m0.004s

********************

In the logs there is a gap in 3 seconds:

2016-12-27 20:11:04.488 37 DEBUG keystonemiddleware.auth_token [-] Authenticating user token process_request /var/lib/microservices/venv/local/lib/python2.7/site-packages/keystonemi
ddleware/auth_token/__init__.py:346
2016-12-27 20:11:07.516 37 DEBUG keystoneauth.session [-] REQ: curl -g -i -X GET http://keystone.ccp.svc.cluster.local:35357/v3/auth/tokens -H "X-Subject-Token: {SHA1}96b8508fa9dd86
7b4c3e0bb65cb1f5f9406248f8" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}e7e72c5b27ec8b68521d05ac933835a17d9b0de3" _http_log_request
/var/lib/microservices/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:337

The issue affects not only Neutron but any projects using Keystone middleware.

********************

Part of API service config:

[keystone_authtoken]
auth_uri = http://keystone.ccp.svc.cluster.local:5000
auth_url = http://keystone.ccp.svc.cluster.local:35357
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = neutron
password = password
memcached_servers = memcached.ccp.svc.cluster.local:11211

Tags:

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-12-29:

Distribution of request duration Edit (9.9 KiB, image/png)

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-12-29:

Keystone middleware uses memached-based cache. However memcached client is threading.local, meaning it will be not shared between different greenlets. Client pool size is not limited (https://github.com/openstack/keystonemiddleware/blob/stable/newton/keystonemiddleware/auth_token/_cache.py#L71-L86) and grow infinitely. Taking into account that there are many workers (default to number of cores =48) and number of greenlets it is very easy to go beyond default connection limit of memcached (=1024 https://github.com/memcached/memcached/wiki/ConfiguringServer#connection-limit).

When the connection limit is reached memcached stops accepting new connections. From client side the attempt lasts for 3 seconds (https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L106), after that the connection fails, token is not retrieved and middleware goes directly to Keystone (this is observed in logs).

Memcached monitors number of rejected connections (the number if far beyond the limit):

root@memcached-288280083-gadd7:/# echo "stats" | nc localhost 11211
...
STAT listen_disabled_num 38200

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-12-29:

Keystone middleware has option "use_advanced_pool" which is stated as "An advanced memcached client pool that is eventlet safe". However with this option enabled the following error observed: http://paste.openstack.org/show/593554/

tags:

added: scale

Sergey Reshetnyak (sreshetniak) on 2016-12-29