Add location record, add locations and priority to file record

Bug #680467 reported by Jason Gerard DeRose on 2010-11-23
34
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Dmedia
High
Jason Gerard DeRose

Bug Description

The meta data for an entire dmedia library is always available locally in CouchDB, but specific media files may not be. Tracking the locations where a given file is stored is probably the most important thing dmedia does. This information will be used to drive three types of dmedia background tasks:

1) Downloading - files that aren't available locally are downloaded from other locations
2) Uploading - newly imported files are uploaded to other locations
3) Reclaiming - files with sufficient durability (enough copies stored in other locations) or low priority (eg, proxy file that can be rendered again) can be deleted locally to free space

These operations can be explicitly requested, but the goal is also to have them happen automatically based on a prediction algorithm (think branch prediction). The prediction algorithm might take a long time to refine enough to be useful, but having it work well (or even at all) right now isn't a priority. However, having the schema in place is important. There will also be some automatic actions based on hard criteria (rather than fuzzy "this is probably what the user wants" prediction)... chiefly that when there are original, user-generated files with low durability (only one copy in the world), dmedia will automatically try to upload them to other locations.

First up, we need a new "location" record type. This record will be used to represent both native dmedia storage and storage on various 3rd-party services (S3, UbuntuOne, whatever). There will be a small amount of common schema in the location record, plus other information specific to the type of location. The common schema will look something this:

{
  "_id": LOCATION_UUID,
  "record_type": "http://example.com/dmedia/location",
  "added": time.time(),
  "durability": 2, # The default durability of this location (per file may differ)
  "plugin": NAME_OF_LOCATION_PLUGIN,
}

"plugin" will be used to, you guessed it, find the correct dmedia location plugin that knows how to deal with this particular backend. There are 3 plugins that I want right off the bat: dmedia (for native dmedia stored on the permanent file system), dmedia_removable (for native dmedia stored on a removable drive), and s3 (for storing on Amazon S3).

In addition to the common schema above, a dmedia location record might have attributes like this:

{
  "plugin": "dmedia",
  "machine": MACHINE_UUID,
  "path": "/home/jderose/.dmedia",
}

Where MACHINE_UUID is a unique ID for a given computer. It should be the same one used in the import records (see lp:680379) and probably the same used by desktopcouch.

Okay, second we need a "locations" attribute in each file record. It will be a dictionary, something like this:

{
  "_id": CONTENT_HASH,
  "record_type": "http://example.com/dmedia/file",
  "locations" {
    "location_uuid_foo": {
      "added": time.time(),
      "durability": 1,
    },
    "location_uuid_bar": {
      "added": time.time(),
      "durability": 2,
      "policy": "'public-read",
    }
  }
}

So the key in the "locations" dict is the LOCATION_UUID of the location record. The value is itself a dict (to make it easily extensible). It should aways have "added" (timestamp of when this file was added to this location) and "durability". Different locations might required other information in the dict in order to retrieve the file (for example, a key assigned by the storage service). On S3 it would be handy to track whether the file is publicly readable (eg my "policy": "public-read" above).

And the final thing we need is a "priority" attribute in the file record. In gross terms, all that matters is whether the priority is "original", which means this is user generated content that cannot be replaced if lost... dmedia will strive to always maintain a (configurable) minimum durability for original files. Any other priority means the file is replaceable one way or another, that the file is basically fair game for reclamation without regard to durability. However, these non-original priorities wont necessarily be treated the same, and this is an area where a smart prediction algorithm would be super sweet.

So the file record with "priority" might look something like this:

{
  "_id": CONTENT_HASH,
  "record_type": "http://example.com/dmedia/file",
  "locations" {...},
  "priority": "original",
}

Some priorities that I think will be useful:

original - don't dare loose these files
downloaded - got in from the Internet, can download it again
paid - can be replaced, but I'd have to pay for it (eg iTunes)
proxy - low res version of an original file, can be transcoded again
cache - temporary files generated for performance reasons (say a pre-render of a node in the Novacut editing graph)
render - if needed original media files and edit description are still available, can be re-rendered (like for Novacut)

There are some priorities (like cache) that don't make sense to track throughout the library. On the other hand, the locations of proxy files is pretty important, especially for Novacut.

This is probably the most critical point of the dmedia design, so I'd love feedback/guidance from stakeholders and anyone who thinks this sounds cool.

Related branches

Changed in dmedia:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Jason Gerard DeRose (jderose)
milestone: none → 0.2
Changed in dmedia:
importance: High → Critical
Changed in dmedia:
milestone: 0.2 → 0.3
Changed in dmedia:
milestone: 0.3 → 0.4
Changed in dmedia:
assignee: Jason Gerard DeRose (jderose) → nobody
importance: Critical → High
Changed in dmedia:
assignee: nobody → Jason Gerard DeRose (jderose)
status: Triaged → In Progress
Jason Gerard DeRose (jderose) wrote :

Quick progress report... I've been documenting the rationale for some of the core schema design here:

http://bazaar.launchpad.net/~jderose/dmedia/stores/view/head:/dmedia/schema.py

This provides some much needed documentation, but also gets the ideas clearer in my own head so I'm better be pared to tackle what is probably the most critical part of the dmedia schema.

I decided that "store" is a bit better/clearer than "location"... so what I call a "location" above in the bug report is what I call a "store" in schema.py.

Jason Gerard DeRose (jderose) wrote :

Okay, wrapping this up. Ended up becoming more involved than originally planned, but it was highly instructive to work on the very formal, test-driven schema definition. Compared to what was originally roughed out in the bug report, the schema is less verbose yet more intuitive. For example:

'durability' => 'copies'
'priority' => 'origin'
'original' => 'user'

So {'origin': 'user'} is the irreplaceable stuff we must protect so carefully. Much better than {'priority': 'original'}, I think. There are all sorts of design issues and additional features this bug has opened up, but they will be address in additional bugs as the work scoped in this bug is in fact complete. Some high priority issues to note:

  1) 'dmedia/store' records need to include the permanent id of the physical storage device they're located on

  2) We need to be careful not to overestimate durability when there are multiple FileStore on the same physical storage device - eg, if you have two FileStore on the same hard drive, and file X is stored in both FileStore, you still only have {'copies': 1}, not {'copies': 2}

  3) 'dmedia/file' records should have an 'atime' attribute storing last access time so that we can better guess what will be needed, what can be reclaimed. Probably shouldn't update at every access, but an algorithm like, "update atime if current atime was more than a day ago", should give enough precision without creating too many conflicts, too much traffic

Anyway, the essential required metadata for a dmedia file now looks like this:

{
   "_id": "ZJAJLVLG5FHKEGFL5VJPNM4UHGWLVGVQ",
   "type": "dmedia/file",
   "time": 1297061923.757858,
   "bytes": 2659418521,
   "ext": "mov",
   "origin": "user",
   "stored": {
       "FLKMHJL2E2WIV4FXAIWLKSTR": {
           "copies": 1,
           "time": 1297061923.757858
       }
   }
}

Changed in dmedia:
status: In Progress → Fix Committed
Jason Gerard DeRose (jderose) wrote :

Filed bugs for the above three issues:

1) lp:714941 "dbus/udisks guru? 'dmedia/store' records should store information about physical storage device"

2) lp:714955 "Don't overestimate durability when multiple FileStore are on same physical storage device"

3) lp:714994 "Add 'atime' to 'dmedia/file' records?"

Changed in dmedia:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers