Lack of free connections to memcached cause keystone middleware to stall

Bug #1653071 reported by Ilya Shakhat
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
fuel-ccp
Fix Released
High
Fuel CCP Bug Team

Bug Description

Deployment: bare-metal, 200 nodes (http://docs.openstack.org/developer/performance-docs/labs/intel_lab_1.html)
Software: Ubuntu 16.04, k8s by Kargo, services deployed in HA configuration (3 replicas), memcached on a separate node.

********************

It's observed that duration of requests to API services is distributed in following manner: 2/3 last 10-20ms, 1/3 last slightly more than 3 seconds (see chart attached). Even if the same request repeated many times it may finish fast or last long. The issue even affects requests that do not do any DB calls:

vagrant@node1:~$ time curl -g -iv -X GET http://neutron-server.ccp:9696/v2.0/ -H "User-Agent: fpython-neutronclient" -H "Accept: application/json" -H "X-Auth-Token: gAAAAABYYr8WzhWMk68aqk89WyIE71Z-dMoIb5_4Hl3PMZ3e2tEAsNjpRGWuzPKO6YL_OoIwM3Hh_A6gM6WifVMGMRQzy_dqTMA6b55okJ1SgdmnlCODiuJWQvrg0Yrn09C29eQQ17DR2z6RQq_Zi3vQiiX5Jk7-Uj3Swv3zh82zfEOHMSWdddk"
Note: Unnecessary use of -X or --request, GET is already inferred.
* Trying 10.224.233.196...
* Connected to neutron-server.ccp (10.224.233.196) port 9696 (#0)
> GET /v2.0/ HTTP/1.1
> Host: neutron-server.ccp:9696
> User-Agent: fpython-neutronclient
> Accept: application/json
> X-Auth-Token: gAAAAABYYr8WzhWMk68aqk89WyIE71Z-dMoIb5_4Hl3PMZ3e2tEAsNjpRGWuzPKO6YL_OoIwM3Hh_A6gM6WifVMGMRQzy_dqTMA6b55okJ1SgdmnlCODiuJWQvrg0Yrn09C29eQQ17DR2z6RQq_Zi3vQiiX5Jk7-Uj3Swv3zh82zfEOHMSWdddk
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Type: application/json
Content-Type: application/json
< Content-Length: 536
Content-Length: 536
< X-Openstack-Request-Id: req-7d402b7b-3ddb-4619-ab8c-c0a7ea573068
X-Openstack-Request-Id: req-7d402b7b-3ddb-4619-ab8c-c0a7ea573068
< Date: Tue, 27 Dec 2016 20:10:51 GMT
Date: Tue, 27 Dec 2016 20:10:51 GMT

<
* Connection #0 to host neutron-server.ccp left intact
{"resources": [{"links": [{"href": "http://neutron-server.ccp:9696/v2.0/subnets", "rel": "self"}], "name": "subnet", "collection": "subnets"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/subnetpools", "rel": "self"}], "name": "subnetpool", "collection": "subnetpools"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/networks", "rel": "self"}], "name": "network", "collection": "networks"}, {"links": [{"href": "http://neutron-server.ccp:9696/v2.0/ports", "rel": "self"}], "name": "port", "collection": "ports"}]}
real 0m3.069s
user 0m0.012s
sys 0m0.004s

********************

In the logs there is a gap in 3 seconds:

2016-12-27 20:11:04.488 37 DEBUG keystonemiddleware.auth_token [-] Authenticating user token process_request /var/lib/microservices/venv/local/lib/python2.7/site-packages/keystonemi
ddleware/auth_token/__init__.py:346
2016-12-27 20:11:07.516 37 DEBUG keystoneauth.session [-] REQ: curl -g -i -X GET http://keystone.ccp.svc.cluster.local:35357/v3/auth/tokens -H "X-Subject-Token: {SHA1}96b8508fa9dd86
7b4c3e0bb65cb1f5f9406248f8" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}e7e72c5b27ec8b68521d05ac933835a17d9b0de3" _http_log_request
/var/lib/microservices/venv/local/lib/python2.7/site-packages/keystoneauth1/session.py:337

The issue affects not only Neutron but any projects using Keystone middleware.

********************

Part of API service config:

[keystone_authtoken]
auth_uri = http://keystone.ccp.svc.cluster.local:5000
auth_url = http://keystone.ccp.svc.cluster.local:35357
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = neutron
password = password
memcached_servers = memcached.ccp.svc.cluster.local:11211

Tags: scale
Revision history for this message
Ilya Shakhat (shakhat) wrote :
Revision history for this message
Ilya Shakhat (shakhat) wrote :

Keystone middleware uses memached-based cache. However memcached client is threading.local, meaning it will be not shared between different greenlets. Client pool size is not limited (https://github.com/openstack/keystonemiddleware/blob/stable/newton/keystonemiddleware/auth_token/_cache.py#L71-L86) and grow infinitely. Taking into account that there are many workers (default to number of cores =48) and number of greenlets it is very easy to go beyond default connection limit of memcached (=1024 https://github.com/memcached/memcached/wiki/ConfiguringServer#connection-limit).

When the connection limit is reached memcached stops accepting new connections. From client side the attempt lasts for 3 seconds (https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L106), after that the connection fails, token is not retrieved and middleware goes directly to Keystone (this is observed in logs).

Memcached monitors number of rejected connections (the number if far beyond the limit):

root@memcached-288280083-gadd7:/# echo "stats" | nc localhost 11211
...
STAT listen_disabled_num 38200

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Keystone middleware has option "use_advanced_pool" which is stated as "An advanced memcached client pool that is eventlet safe". However with this option enabled the following error observed: http://paste.openstack.org/show/593554/

tags: added: scale
Changed in fuel-ccp:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Fuel CCP Bug Team (fuel-ccp-bugs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-ccp-memcached (master)

Reviewed: https://review.openstack.org/415661
Committed: https://git.openstack.org/cgit/openstack/fuel-ccp-memcached/commit/?id=08eec5545492e62685404ecb0cdd528b172841e6
Submitter: Jenkins
Branch: master

commit 08eec5545492e62685404ecb0cdd528b172841e6
Author: Proskurin Kirill <email address hidden>
Date: Thu Dec 29 09:56:02 2016 +0000

    Bump memcached defaults a bit

    Plus make max_conn configurable(default is 1024)

    Closes-Bug: #1653071
    Change-Id: Iccdf7a75b3f2c3400b49dce36571206a8c0d34d3

Changed in fuel-ccp:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.