glanceclient is not handling http connection pools properly with python-requests

Bug #1341777 reported by Matt Riedemann
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance Client
Fix Released
High
Ian Cordasco
glance_store
Invalid
Medium
Ian Cordasco
python-swiftclient
Fix Released
Undecided
Ian Cordasco

Bug Description

This:

message:"HttpConnectionPool is full, discarding connection: 127.0.0.1" AND tags:"screen-g-api.txt"

shows up nearly 420K times in 7 days in Jenkins runs, on 100% successful jobs, so it's not a very good warning if it's expected to happen this much. It should be debug level if it's this pervasive, unless it's masking something else, but the logstash trends don't suggest that.

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSHR0cENvbm5lY3Rpb25Qb29sIGlzIGZ1bGwsIGRpc2NhcmRpbmcgY29ubmVjdGlvbjogMTI3LjAuMC4xXCIgQU5EIHRhZ3M6XCJzY3JlZW4tZy1hcGkudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDUzNzE0NTU4NTZ9

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it might be coming from oslo's log module for default log levels:

http://git.openstack.org/cgit/openstack/glance/tree/glance/openstack/common/log.py#n168

Revision history for this message
Matt Riedemann (mriedem) wrote :

Hmm, note this ML thread that sounds similar:

http://lists.openstack.org/pipermail/openstack-dev/2014-March/030763.html

And oslo bug 1297950.

Revision history for this message
Matt Riedemann (mriedem) wrote :

So the problem is just that glance isn't setting a higher default max pool size, it defaults to 1 so any multi-threaded environment is going to top that out quickly:

https://urllib3.readthedocs.org/en/latest/pools.html

To compare, looks like nova sets wsgi_default_pool_size=1000 by default.

http://docs.openstack.org/trunk/config-reference/content/list-of-compute-config-options.html

Neutron sets the request backlog to 4096:

http://docs.openstack.org/trunk/config-reference/content/networking-options-wsgi.html

heat-api sets the request backlog size to 4096 also:

http://docs.openstack.org/trunk/config-reference/content/orchestration-configuring-api.html

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1295812 might be related.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The wsgi server code in glance defaults the number of threads in the eventlet green pool to be 1000 (like nova), it's not configurable:

http://git.openstack.org/cgit/openstack/glance/tree/glance/common/wsgi.py#n191

The self.threads variable in the wsgi server is only used if create_pool is called:

http://git.openstack.org/cgit/openstack/glance/tree/glance/common/wsgi.py#n254

That creates the eventlet thread pool:

    def create_pool(self):
        return eventlet.GreenPool(size=self.threads)

That is only called from start() if CONF.workers == 0, which is not the default (only set in functional tests from what I see):

http://git.openstack.org/cgit/openstack/glance/tree/glance/common/wsgi.py#n243

There is a run_server method that will create the eventlet thread pool regardless of CONF.workers, but that's not called from glance/cmd/api.py or glance/cmd/registry.py (those call Server.start() to start the daemon).

http://git.openstack.org/cgit/openstack/glance/tree/glance/common/wsgi.py#n311

It looks like Server.start() will call Server.run_child() for each worker set in CONF.workers, and run_child() will fork a process. If the pid is 0 then it interrupts and calls run_server() which will create the eventlet thread pool with 1000 threads.

Revision history for this message
Matt Riedemann (mriedem) wrote :

From logstash this is also a problem in the nova-cpu:

http://goo.gl/AFUOvW

And cinder-volume logs:

http://goo.gl/UGr6LA

In the cinder-volume case, it really started showing up around 7/11, so we might want to see what clients cinder is using and what changed on 7/11. I'm not seeing anything particularly interesting though:

https://github.com/openstack/cinder/commits/master

Nova logs really started spiking on 7/11 too, so maybe there was an external dependency update on 7/11 that is causing the big spike in logging of http connection pool warnings.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Doing a diff of pip-freeze output on the same gate job from 7/10 and 7/11 shows that pecan was updated from 0.6.0 to 0.6.1, and there was a single bug fix in that release:

https://github.com/stackforge/pecan/commit/1a2933e23f13a18de5c004327282c85334a1b135

pecan doesn't use requests or urllib3 directly, but I'm wondering if it's doing something with the response that is causing connections to not get put back in the pool, i.e. notes in the requests docs here:

http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

http://docs.python-requests.org/en/latest/user/advanced/#keep-alive

Revision history for this message
Matt Riedemann (mriedem) wrote :

The n-cpu and c-vol issues are from this change to python-glanceclient on 7/11 to use requests:

https://github.com/openstack/python-glanceclient/commit/dbb242b776908ca50ed8557ebfe7cfcd879366c8

That lines up with what we're seeing in the g-api logs with swiftclient and bug 1295812, and the times line up with logstash for n-cpu/c-vol logs and the fact they both use python-glanceclient.

summary: - HTTPConnectionPool is full warning should be a debug message
+ glanceclient is not handling http connection pools properly with python-
+ requests
no longer affects: glance
Matt Riedemann (mriedem)
Changed in python-glanceclient:
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

glanceclient is setting stream=True when making the request, which according to this could be bad:

http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

http://docs.python-requests.org/en/latest/user/advanced/#keep-alive

But it looks like that might be necessary for doing image transfer requests, so we can't just set stream=False, but we should also make sure the connection is closed properly, which according to the requests docs means calling response.close or using a closing context on the response.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in python-glanceclient:
status: Confirmed → In Progress
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Nikhil Komawar (nikhil-komawar) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is a ~clean run of the glanceclient fix to see if the warning is still showing up:

http://goo.gl/vuXx00

There were 75 hits of the warning across 4 jobs, mostly in n-cpu logs but also in c-vol logs.

I compared that to the same kind of run with another glanceclient change, 82126, and there were 66 hits, so it doesn't appear that the stream change is really fixing the problem, but I think it's probably still a good fix given the python requests docs and what they say about using the stream kwarg and how it can lead to pool/connection issues.

I think after this we'll want to POC a change where we adjust pool_size in HTTPAdapter beyond the default 10, maybe double it at first and then try 100 and 1000 (default eventlet greenpool size is 1000 on the server side), or maybe nproc * 10, something like that since we have nproc API workers now in most projects, e.g. nova.

Once we can figure out where the warnings start to drop off, we can make the pool size configurable for the client and then cinder and nova can set that according to their load needs when creating the client.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another idea on the max pool size is to match what our backing db max_connections values are, which for postgresql in the gate right now is 200 (default of 100 wasn't cutting it) and mysql is 151 (default), so we could try doing http request connection pool around those numbers because if we top out connections in the db we're already shooting ourselves in the foot.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-glanceclient (master)

Fix proposed to branch: master
Review: https://review.openstack.org/107495

Matt Riedemann (mriedem)
Changed in python-glanceclient:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-glanceclient (master)

Reviewed: https://review.openstack.org/107080
Committed: https://git.openstack.org/cgit/openstack/python-glanceclient/commit/?id=68c1d1fbc614d11b95f2fd6da26fd669e40f4186
Submitter: Jenkins
Branch: master

commit 68c1d1fbc614d11b95f2fd6da26fd669e40f4186
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 15 07:58:34 2014 -0700

    Don't stream non-binary requests

    Setting stream=True with requests can lead to issues
    with not closing the connection so the urllib3 connection
    pool is not freed up, so only set stream=True if making
    a request with application/octet-stream content-type.

    See the body-content-workflow and keep-alive sections
    in the requests docs here for more information:

    http://docs.python-requests.org/en/latest/user/advanced/

    Note that commit dbb242b changed the response body_iter
    code to potentially return a six.StringIO object rather
    than the old ResponseBodyIterator class and since the
    images client code is not converting the body_iter into
    a dict using json.loads, we have to do that directly
    in the _request method where the body_iter is
    constructed.

    Co-authored-by: Flavio Percoco <email address hidden>

    Change-Id: I86572b69c4511f933c9676108190271874346302
    Partial-Bug: #1341777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-glanceclient (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/107495
Reason: This change doesn't seem to make a difference, so hopefully someone has other ideas.

Matt Riedemann (mriedem)
Changed in python-glanceclient:
assignee: Matt Riedemann (mriedem) → nobody
status: In Progress → Confirmed
Revision history for this message
Flavio Percoco (flaper87) wrote :

Is this still happening?

Ian Cordasco (icordasc)
Changed in python-glanceclient:
assignee: nobody → Ian Cordasco (icordasc)
Revision history for this message
Ian Cordasco (icordasc) wrote :
Changed in python-glanceclient:
status: Confirmed → In Progress
Ian Cordasco (icordasc)
Changed in python-glanceclient:
milestone: none → v0.16.0
Revision history for this message
Ian Cordasco (icordasc) wrote :

I've marked this as affecting glance-store as well. See my comment on patchset 4: https://review.openstack.org/#/c/147678/

Changed in glance-store:
assignee: nobody → Ian Cordasco (icordasc)
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ian Cordasco (icordasc) wrote :

Looking into this further. It doesn't appear to be the swift driver for glance-store causing this. The lines immediately following it come from swift client, so I'm going to mark this as invalid for glance-store but add swiftclient and dig into that right now.

Changed in glance-store:
status: Triaged → Invalid
Changed in python-swiftclient:
assignee: nobody → Ian Cordasco (icordasc)
Ian Cordasco (icordasc)
Changed in python-swiftclient:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-swiftclient (master)

Fix proposed to branch: master
Review: https://review.openstack.org/149043

Revision history for this message
Ian Cordasco (icordasc) wrote :

swiftclient had a similar issue when streaming response downloads. The proposed fix in https://review.openstack.org/#/c/149043/ should take care of this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-glanceclient (master)

Reviewed: https://review.openstack.org/147678
Committed: https://git.openstack.org/cgit/openstack/python-glanceclient/commit/?id=f2107512ee4be7bfbc1aaefdef12e1cba1147777
Submitter: Jenkins
Branch: master

commit f2107512ee4be7bfbc1aaefdef12e1cba1147777
Author: Ian Cordasco <email address hidden>
Date: Thu Jan 15 16:09:23 2015 -0600

    Close streamed requests explicitly

    If we don't explicitly close a response after streaming its download,
    then we can run into HTTPConnectionPool full warnings. It also will hurt
    performance if we have to continuously create new sockets for new
    responses. Calling close will return the connection to the pool so it
    can be reused. Note this is only necessary when streaming a response. If
    we don't stream it, then requests will return the connection to the pool
    for us.

    Change-Id: I803bd4dd0e769c233501d5e5ff07a19705fbe233
    Closes-bug: 1341777

Changed in python-glanceclient:
status: In Progress → Fix Committed
Changed in python-glanceclient:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-swiftclient (master)

Reviewed: https://review.openstack.org/149043
Committed: https://git.openstack.org/cgit/openstack/python-swiftclient/commit/?id=8756591b0ad24320a0b6fa3b2f7513e17ec46b6c
Submitter: Jenkins
Branch: master

commit 8756591b0ad24320a0b6fa3b2f7513e17ec46b6c
Author: Ian Cordasco <email address hidden>
Date: Wed Jan 21 14:38:02 2015 -0600

    Release connection after consuming the content

    When using stream=True, we need to make sure we release a connection
    back to the connection pool to avoid warnings clogging up the logs. We
    can do this by releasing the connection once we've read all the content
    from the response. In situations where stream=False, requests already
    does this for us.

    Related-bug: 1424732
    Closes-bug: 1341777
    Change-Id: Id1c72ad86135a49d4b985992d788736b65f7dbda

Changed in python-swiftclient:
status: In Progress → Fix Committed
Changed in python-swiftclient:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-swiftclient 2.4.0

This issue was fixed in the openstack/python-swiftclient 2.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.