Use Python 3.3 os.posix_fadvise()

Bug #898957 reported by Jason Gerard DeRose
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
FileStore
Fix Released
High
Jason Gerard DeRose

Bug Description

Python 3.3 is going to be awesome. Among the new awesomeness is os.posix_advise():

http://docs.python.org/dev/library/os.html#os.posix_fadvise

This is great for filestore as mostly what it does is sequential passes through entire files (importing, copying from one store to another, verifying, etc). This should allow us to get similar (or even higher) read IO utilization without using such deep a queue, which will reduce memory usage during imports (always nice).

My hunch is we'll still get the best results with 2 threads, but pehaps the work should be reorganized a bit so that the producer thread reads and hashes, while the consumer thread writes and saves metadata to CouchDB.

Currently the producer just reads, while the consumer hashes, writes, and saves metadata. This was needed so that the producer would *always* be reading during an import so we could get the read IO utilization up very near the theoretical maximum (which we have).

Python 3.3 is due out August 2012, so we should be able to land this in the Novacut 12.08 release.

Related branches

Revision history for this message
Jason Gerard DeRose (jderose) wrote :

I re-targeted this for 12.06 as Python 3.3 beta1 comes out in June, so that would a good time to implement this and start serious testing with Python 3.3.

http://www.python.org/dev/peps/pep-0398/#id2

The fallback for Python 3.2 will be to not use fadvise, same as we do currently.

Changed in filestore:
milestone: none → 12.06
Changed in filestore:
milestone: 12.06 → 12.07
Changed in filestore:
milestone: 12.07 → 12.08
Changed in filestore:
milestone: 12.08 → 12.09
Changed in filestore:
milestone: 12.09 → 12.10
Changed in filestore:
milestone: 12.10 → 12.11
Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Initial experiments with os.posix_fadvise() are very promising. My test script reads through files in 8 MiB chucks at a time, and does nothing else with them (no hashing, etc). So in theory, there's not a lot of "down-time" so to speak when Python isn't reading, when you'd expect os.POSIX_FADV_SEQUENTIAL to provide a big benefit.

Even so, my test jumped from 26.4 MB/s to 28.9 MB/s. During an actual Dmedia import, I'd except larger gains because Python is busy doing lots of other stuff.

Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Before you run the benchmark, be sure to clear the page cache like this:

sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches

Changed in filestore:
status: Triaged → In Progress
assignee: nobody → Jason Gerard DeRose (jderose)
Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Okay, after some more careful benchmarking with batch_import_iter(), I'm getting about a 9% performance improvement, which is quite impressive for such a simple change.

Revision history for this message
Jason Gerard DeRose (jderose) wrote :

Wow, now using os.POSIX_FADV_WILLNEED, I'm up to a 17% performance improvement.

31.2 MB/s over USB2.

Changed in filestore:
status: In Progress → Fix Committed
Changed in filestore:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.