Better support for new twisted.web.client.Agent in s3

Bug #767205 reported by Drew Smathers
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
txAWS
In Progress
Medium
Drew Smathers

Bug Description

s3 currently uses getPage() for all operations. This can be detrimental to performance for doing things like uploading large files - the entire file has to be loaded into memory and written to the stream at once. getPage() also makes it harder (impossible) to do multi-part upload or chunked transfer encoding on a receiving end.

I would like to deprecate getPage() and write an adaptor layer to prefer the new Agent if available and downgrade to getPage() otherwise.

The initial implementation would provide simple String-based and File-based IBodyProducer implementations.

Related branches

Revision history for this message
Jamu Kakar (jkakar) wrote :

This sounds like a nice plan to me.

Changed in txaws:
importance: Undecided → Medium
milestone: none → 0.2
Revision history for this message
Duncan McGreggor (oubiwann) wrote :

Drew, it's been ages! And nice to see your bug report -- I'm with Jamu, this sounds great!

Revision history for this message
Drew Smathers (djfroofy) wrote :

Cool. So I will proceed. :)

Thomas Herve (therve)
Changed in txaws:
milestone: 0.2 → 0.3
Revision history for this message
Zooko Wilcox-O'Hearn (zooko) wrote :

I think this might allow me to do something that I want for Tahoe-LAFS, which is to upload an object to S3 which is too large to fit into RAM, where I don't know the end of the object when I upload the beginning of the object, and where I don't even know the size of the object when I upload the beginning of the object. Yes, please!

Revision history for this message
Daira Hopwood (daira) wrote :

This is very much desired for Tahoe-LAFS' S3 backend.

Revision history for this message
Drew Smathers (djfroofy) wrote :

Yes, I plan on resuming work on this week since I'd like to get this working for a project at work which is still uploading files through txaws only by dent of some dirty monkey patches.

I began on this branch (https://code.launchpad.net/~djfroofy/txaws/agent-767205) but hit a wall b/c my approach smelled like ripe garbage for some reason.

In sum, it's really hard to needle this in while retaining backwards compatibility and still have a meaningful interface.

Problems:

1. Meaningfulness of existing APIs

Here's the current put_object API:

     def put_object(self, bucket, object_name, data, content_type=None,
                   content_length=None, metadata={}, amz_headers={}):
         ...

It's confusing from an API standpoint to overload "data" (which arguably just sounds "a str" to most devs). Possible solutions might be making a separate API. example:

    def produce_object(self, bucket, object_name, producer=None, content_type=None,
          content_md5=None, content_length=None, metadata=None, amz_headers=None)

Or optionally add more keyword arguments to put_object (seems to make things convoluted that way).

2. Custom Producers

Right now my work is hard coded to one producer but I don't like this. Who knows how the data gets produced? Filesystem? AMQP? 0MQ? pyaudio? Producers should be pluggable through either extension of the Query class or method parameterization.

3. Computing md5

Right now this internal to implementation - user can't control - and string based. I've done some weird stuff to handle both strings and maybe a producer - I think the user should be allowed to compute the md5 externally (caching for example with md5 file, for example) or a least provide a producer that also implements some known interface (IMD5Generator.generateMD5, or something)

4. _newclient.py, _oldclient.py makes me feel ill, somehow

The only logical route I could see to basing things on Agent was to essentially rewrite the base Query. The implicit upgrading (if user has Agent makes me slightly nervous). Should we default to using new Query for other calls that don't need streaming? Warn the user the eventually a modern version of twisted will be required and thus deprecate Query in _old_client?

Feedback Please

The above are all essentially my desiderata for the Agent-based s3 interface for uploading objects. I consider multipart uploads etc outside of the scope but would probably implement something for that soon after agreeing on interface for using Agent and implementing. I'd like to know other's thoughts on the ideal interface or safe ways to replacing he current getPage()-like implementation with Agent.

Revision history for this message
Drew Smathers (djfroofy) wrote :

Discussion on #twisted with idnar:

16:01 < idnar> anyhow, so, what I was thinking re: txAWS is
16:01 < idnar> it might be better to first replace getPage with Agent, without changing the txAWS API at all
16:02 < idnar> and then once that's done, look at extending the API to allow streaming uploads / downloads too
16:05 < djfroofy> idnar: ok, so first replace w/out fancy body producer, then add support for fancier producers (maybe just FS based) and corresponding apis as separate
                  feat?
16:05 < idnar> djfroofy: yeah, I think so
16:05 < djfroofy> idnar: but "optionally" replace at runtime is what we should do right? Agent support is pretty modern.
16:05 < idnar> hmm
16:06 < idnar> I'd lean toward just requiring a newish Twisted
16:06 < itamar> it's been in since ... 10.1 at least?
16:06 < itamar> since 9.0, in fact
16:06 < idnar> since 9.0, but a lot of the APIs are newer than that
16:06 < itamar> right
16:06 < idnar> eg. FileBodyProducer or whatever
16:07 < idnar> anyhow, you might want to raise the issue of backward compatibility on the mailing list, but I'd vote for just requiring 10.1 or 11.0 or whatever
16:07 < idnar> I can't remember, does Agent have 100-continue support yet?
16:07 < idnar> anyhow, that's an example of something we could take advantage with without changing the txAWS APIs at all
16:08 < idnar> and smaller branches are better than bigger branches, of course
16:08 < idnar> although I gather that for your purposes, using Agent without streaming support wouldn't help you much
16:08 < djfroofy> idnar: yes, this was turning into a bigger branch than i liked
16:09 < djfroofy> do we want to change the BaseQuery in txAWS to use Agent or just the extension used by s3?
16:10 < idnar> change BaseQuery, I think

Revision history for this message
Daira Hopwood (daira) wrote :

I mildly prefer adding new methods to overloading using keyword arguments.

It's OK from our perspective to depend on a version of Twisted that has twisted.web.client.Agent (that would be Twisted >= 9.0, released in December 2009, and it would be alright for it for be a bit later than that if there were bugs / missing features in the 9.0 implementation of Agent).

It's not good to be restricted to one producer implementation. I don't understand why supporting arbitrary producers would require extending Query, though. Shouldn't the new put method be returning an IConsumer, so that the client can call registerProducer on it with its own choice of producer? (http://twistedmatrix.com/documents/11.0.0/core/howto/producers.html)

In our case (Tahoe-LAFS), the client doesn't know the Content-MD5 before it starts sending the data. Do you know if there is a way to tell S3 what the Content-MD5 should have been at the end of the transfer? (For Tahoe-LAFS this is not an integrity issue because we have other hashes protecting integrity, but it is a data preservation issue if we end up thinking that we've stored the data correctly on a particular server when we haven't.)

Revision history for this message
Daira Hopwood (daira) wrote :
Revision history for this message
Tristan Seligmann (mithrandi) wrote :

FileSender implements IProducer, not IBodyProducer (the latter has an additional length attribute; see the interface definition for details).

Revision history for this message
Drew Smathers (djfroofy) wrote :

Also twisted.web.client now has a FileBodyProducer which did not exist when I started my branch. I do not plan on continuing with agent-767205 but starting fresh (using FileBodyProducer from twisted if available). Concerning MD5, since this is optional to s3 so far as I an tell, I'd like the interface also to not require this - but allow it to be passed optionally in cases where it is known upfront.

Revision history for this message
Drew Smathers (djfroofy) wrote :

Hey ya'll. Had a really big project I was working on the past few months which took some time away from this. This has come up again at work though since we need this for a file uploader daemon.

Looking back at the branch I started I think I took the wrong approach. I also think we need a ticket first to do something more low-level: essentially modernize the client portion (txaws.client.base) to use new client features.

I'll open a new ticket to cover this. Look forward to discussing with anyone interested.

Revision history for this message
Duncan McGreggor (oubiwann) wrote :

On 31 Jan 2012 - 16:06, Drew Smathers wrote:
> Hey ya'll. Had a really big project I was working on the past few months
> which took some time away from this. This has come up again at work
> though since we need this for a file uploader daemon.
>
> Looking back at the branch I started I think I took the wrong approach.
> I also think we need a ticket first to do something more low-level:
> essentially modernize the client portion (txaws.client.base) to use new
> client features.

Sweet.

> I'll open a new ticket to cover this. Look forward to discussing with
> anyone interested.

Looking forward to it, Drew!

Drew Smathers (djfroofy)
Changed in txaws:
assignee: nobody → Drew Smathers (djfroofy)
status: New → In Progress
Revision history for this message
Daira Hopwood (daira) wrote :

Hi Drew. Has the new ticket been opened?

Support for streaming S3 GETs and PUTs has just increased in urgency for us (Tahoe-LAFS and Least Authority Enterprises) -- within the next few days we'll need to make a decision about whether to help with your branches or work on our own implementation. (If we do the latter then I'd like to contribute it back to txaws if possible, but can't guarantee that will happen.) Can you summarise the current state of your branches?

Revision history for this message
Drew Smathers (djfroofy) wrote :

Hi David-Sarah. The new ticket has indeed been opened and there's a branch there ready for review.

https://bugs.launchpad.net/txaws/+bug/924459

Tangentially, I chatted with Zooko and he mentioned you might not be interested multipart upload ops anymore, but it you are, I also have numerous branches related to MP that builds on this. Also ready for review. ;-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.