OpenStack Object Storage (swift)

Client may hold socket open after ChunkWriteTimeout

Bug #1572719 reported by Chris Auston on 2016-04-20

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Confirmed	High	Unassigned
	OpenStack Security Advisory	Won't Fix	Undecided	Unassigned

Bug Description

You can reproduce this by issuing a GET request for a few hundred MB file and never consuming the response, but keep the client socket open. Swift will log a 499 but the socket does not always close.

ChunkWriteTimeout is meant to protect the proxy from a slow reading client:
https://github.com/openstack/swift/blob/master/swift/proxy/controllers/base.py#L889-L905

Sometimes when this exception is thrown there is still data in the process socket buffer, so when eventlet tries to close the socket it first flushes it:
https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L631
https://hg.python.org/cpython/file/v2.7.11/Lib/SocketServer.py#l711

The problem is that if the client is not consuming the socket buffer then that flush will wait forever; it's trying to write on a socket that just threw a timeout trying to write! The flush write is not protected by any timeout.

As far as I can tell, this WRITE_TIMEOUT does nothing:
https://github.com/openstack/swift/blob/master/swift/common/wsgi.py#L407

wsgi.server() takes a socket_timeout that might be what we're after?
https://github.com/openstack/swift/blob/master/swift/common/wsgi.py#L437-L440

Even with socket_timeout, eventlet needs to be patched. This should be in a finally block:
https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L636-L637

All of this is probably mitigated by most operators setting an idle timeout in their load balancers, but I wanted to report it. Going directly to a proxy I was able to hold sockets open for long periods of time.

I did the initial research on version 2.2.2 but I was able to reproduce on 2.7.0. I'm trying to translate links to master branch on github. I apologize in advance if it's not quite right.

See original description

Tags:

Revision history for this message

Jeremy Stanley (fungi) wrote on 2016-04-20:

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

description:	updated
Changed in ossa:
status:	New → Incomplete

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-04-30:

I've subscribed Victor to help us investigate/confirm the eventlet issue. If this can't be entirely fixed with-in swift source code, we'll need an eventlet fix and ideally wait for new releases of the eventlet library.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-05-02:

Note that it is similar to bug 1466549, but instead of closing the connection quickly you keep it open without reading it... Though it's not obvious if this need to be fixed in swift source code...

@swift-coresec please triage this bug report.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-05-02:

It seems likely the fix is to move the connection.close into the finally block in eventlet.wsgi.HttpProtocol.finish

We should be able to test this by implementing the fix in a subclass since HttpProtocol is already over-ride-able in wsgi.server:

https://github.com/eventlet/eventlet/blob/fb067b63c705c5bc345047f545361a6fad53bbfc/eventlet/wsgi.py#L770

If we fix it in our code it's a) backportable but b) disclosing the security issue in eventlet

Does anyone know how to reach out to the eventlet team regarding some sort of shared disclosure?

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-05-16:

Clayg, it seems like eventlet is using github to track issues, and afaik it doesn't support private report. Thus we could either contact the main contributors by mail to coordinate a new release of eventlet, or we could also make this bug report public and figure this out in the open.

Revision history for this message

Victor Stinner (vstinner) wrote on 2016-05-16:

"Thus we could either contact the main contributors by mail to coordinate a new release of eventlet, or we could also make this bug report public and figure this out in the open."

Having an embargo makes contribution harder, I would prefer to use a regular and public eventlet issue to benefit of their Travis CI.

Revision history for this message

Grant Murphy (gmurphy) wrote on 2016-05-16:

It would be nice to have a concrete mitigation before fixing in the open. Has anybody had a chance to explore this statement a little more?

"All of this is probably mitigated by most operators setting an idle timeout in their load balancers.."

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-05-30:

I subscribed eventlet top contributors (Sergey Shepelev and Jakub Stasiak). It seems like eventlet is unable to close a socket if it has un-read data and this can be used to cause resources exhaustion for daemon like swift.

Revision history for this message

Sergey Shepelev (temoto) wrote on 2016-06-01:

Hello, Eventlet here.
Thanks for contacting.
I'm writing a public issue about this in our Github project, without mentioning Swift or any other projects.
Fix is ready, it's a bit hard to make a test, then I'll upload new release with this fix.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-06-01:

#10

That's very appreciated Sergey, thank you!

@swift-coresec, do we want to implement the swift work-around as a swift subclass or should we wait for a requierements bump to benefit from the eventlet fix ?

Anyway, once eventlet is patched, do you mind if we make that bug public ?

Revision history for this message

clayg (clay-gerrard) wrote on 2016-06-01:

#11

I think this is medium impact at best - it's very much like many other slow resource exhaustion from ill behaved clients or bad requests we've squashed in the past. Generally these are mitigated in public environments with monitoring and alerting - although I'm sure they can be annoying!

Once the eventlet issue is public - and especially following a release - this should of course become public!

I'd prefer to eat the upstream version rev than to try and implement a hacky work-around in our tree.

... IMHO

Revision history for this message

Sergey Shepelev (temoto) wrote on 2016-06-01:

#12

I tried and then I tried again but was unable to construct a test case that would actually leak file descriptor. It's all this silly line to blame: https://github.com/eventlet/eventlet/commit/30fd49f44b5e6edeff5fa45e569e91b8ad1a5841 And while moving that `shutdown+close` under `finally` actually seems a good idea, but is blurred by the fact that there is no internal use case to explain that change.

So my guess: somewhere somehow like `class ChunkWriteError(Timeout)`, the actual `socket.timeout` mutates to another exception which is not expected by Eventlet wsgi server.

Taking this long proves that these problems are at the limit of my cognitive capacity, so if more clever people see that this issue could actually be cured at Eventlet level, I'm glad to release fix.

Otherwise, I suggest adding another parent class: `socket.timeout` to your chunk errors or explicitly catch chunk errors and force socket close.

Revision history for this message

Chris Auston (cfauston) wrote on 2016-06-01:

#13

It's been awhile since I looked at this but IIRC the ChunkWriteTimeout triggers a GeneratorExit or two in proxy_logging middleware and probably _get_response_parts_iter. I can't quite remember the call sequence, but that leads to wsgi trying to close the client socket and triggering the flush.

When I used wsgi.server(..., socket_timeout=30, ...) a socket.timeout was thrown but still didn't close the socket which was why I suggested the finally block.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-06-13:

#14

Based on comment#12, it seems like this bug can be safely made public. If nobody object, let's do this by the end of the week.

Thank you Sergey for the investigation!

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-06-27:

#15

Opening this bug based on above comments

description:	updated
tags:	added: security
information type:	Private Security → Public

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2016-10-05:

#16

I was able to reproduce that issue on Mitaka (openstack-swift-2.7.0-2.el7.noarch and python2-eventlet-0.17.4-4.el7.noarch) using python socket module to open many GET connections without consuming the response.

Extracts from swift logs:
proxy-server[14729]: ERROR with Object server 192.168.240.13:6000/swiftloopback re: Trying to GET /v1/AUTH_e90d6d7d952f4a6d9933047b461015ff/test/large_file: Timeout (10.0s) (txn: tx706ead3ee0d64af2ba382-0057f4691c)
proxy-server[14729]: Object GET returning 503 for [] (txn: tx706ead3ee0d64af2ba382-0057f4691c) (client_ip: 192.168.240.14)

After enough connections have failed with this 503 errors (about 8000 for my all-in-one instance), all swift proxy-server processes are unresponsive until the offending client is terminated, or the service restarted.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-12-01:

#17

this is probably *related* to lp bug #1568650 - but not exactly the same.

The same repo script [1] with a significantly slow enough client (> 60 seconds by default) will trigger the ChunkWriteTimeout. The script as written just simulates a really *slow* client; not necessarily bad behaved and trying to wreck things - but with high enough sleep you could simulate a Slowloris/DOS if that's what you're going for.

In my most recent test the proxy seemed to square it's backend connections, but the socket to the client remained established.

1. https://gist.github.com/clayg/2458a4a7e1451c75fbc5a63fcae11635

Changed in swift:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2017-01-18:

#18

Having another look at that issue, it sounds like slow client shouldn't be handled by OpenStack services but rather with a load balancer, especially if the service is Internet facing.

I initially thought the problem would happen between the proxy and the object server even after the client got disconnected, but this is not the case since all the sockets are effectively released when the client is disconnected.

Since there is already a Security Note in the process to cover load balancers usage in front of OpenStack service, I suggest we close the ossa task and triage this report as a class D according to the VMT taxonomy ( https://security.openstack.org/vmt-process.html#incident-report-taxonomy )

Revision history for this message

Jeremy Stanley (fungi) wrote on 2017-01-18:

#19

I agree with this being class D (security hardening opportunity). Since the report is already public, we can just go ahead and switch the security advisory task to won't fix, then change it back later if there are objections. I've done so now.

Changed in ossa:
status:	Incomplete → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.