HTML5 File API uploader that calculates content hash client side

Bug #719740 reported by Jason Gerard DeRose
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Dmedia
Fix Released
High
Jason Gerard DeRose

Bug Description

Large file uploads (or downloads) over HTTP are very likely to incur data errors that TCP doesn't detect (from personal experience uploading to S3, I know this all to well).

I'm pretty sure browsers aren't calculating md5sum for files uploaded (need to confirm). Regardless, it would be much nicer if the actual dmedia-style content-hash (sha1 hash-list) could be calculated client side so that uploads can be verified with 8MiB granularity, so that we can robustly resume uploads.

Fortunately, the HTML5 File API lets us do this easily. Probably some good starting points:

https://developer.mozilla.org/en/using_files_from_web_applications

http://hacks.mozilla.org/category/fileapi/

This seems to be one of the nicer, modern, and maintained JavaScript sha1 implementations:

http://code.google.com/p/crypto-js/

(Mind blowing that ECMAScript 5 didn't add native md5/sha1.)

For details on how the dmedia content-hash is calculated, see:

http://bazaar.launchpad.net/~dmedia/dmedia/trunk/view/head:/dmedia/filestore.py?start_revid=161#L294

Related branches

Changed in dmedia:
status: Triaged → In Progress
Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Okay, little update... crypto-js apparently has serious bugs when hashing binary data, produces incorrect results.

Now I'm using the same sha1 implementation that CouchDB uses for Futon - http://pajhome.org.uk/crypt/md5/sha1.html

That one actually works.

Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Another update... I was worried this was going to be crazy slow, but then I remembered that having FireBug enabled means no Firefox JIT. Performance isn't awesome, but it's probably bearable:

For 20MB file:

With FireBug enabled - 42.261 seconds
With FireBug disabled - 3.579 seconds

Changed in dmedia:
assignee: nobody → Jason Gerard DeRose (jderose)
Changed in dmedia:
milestone: 0.4 → 0.5
Changed in dmedia:
status: In Progress → Fix Committed
Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Okay, all done. To complete the bug report, this is what I wrote for the merge request:

Okay, before this becomes even more epic, I'm calling this a good point to merge this.

As the sha1 hashing in JavaScript is quite slow, computing the full dmedia-style content hash prior to beginning an upload wont be fun for the user, so we compute leaf by leaf as we upload. The leaf_size is configurable and is supplied by the server... it need not match the dmedia leaf size.

So original focus changed somewhat... rather than trying to replicate parts of the dmedia protocol in JavaScript, the focus became just building a resumable HTML5 file uploader with strong data integrity guarantees.

To test, start the test server:

  ./test-server.py

And then point your browser to:

  http://localhost:8000/

Choose a largish file (say 50 MB ish), and watch it go. If you have Firebug installed, disabled it as it will make the hashing painfully slow. But Firefox can be very fast... on my system (64bit Natty), Firefox 4 is considerably faster than Chromium 10 (3seconds vs 5seconds for the file I keep testing with).

The test server doesn't do anything useful like actually save the file (that's what Martin Owens' code is for), but the test server can do one very important thing... simulate failure!

Run the test server with random failures enabled like this:

  ./test-server.py --fail

This will simulate corrupted leaf uploads (in which case status of "412 Precondition Failed" is returned) and simulate the server loosing track of your multipart upload altogether (in which case a "409 Conflict" is returned).

dmedia is still highly broken because of a GObject Introspection issue in Natty, so you *cannot* run the full test suite... it will fail before making it to a single test. However, you can run just the tests related to the uploader like this:

  ./setup.py test --names=test_scripts

This is testing the JavaScript using my new magical JavaScript => Python tester.

Woot.

Changed in dmedia:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.