clean up file/archive access in deriver
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Deriver |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
Rethink some of the object classes to better encapsulate knowledge about names and formats of image file and image archives. We now have a lot of redundancy-
If those tasks are abstracted into object classes (each class could know how to deal with different kinds of source images, but all would present the same interface to the module using it), modules won't have to worry about how the individual image files are named, whether an archive is tarred or zipped, etc. That stuff is ugly and easy to mess up, and is being done repeatedly in nearly every module.
The deriver could pass each module a file-handling object of the right class for the source format. The messages the object would need to handle might include:
"image count"
returns the total number of source images (needed for 250-chunking, etc.)
"give me the next image"
returns a pointer of some kind to the file (a string containing fully qualified file spec?)
or, for modules that merge multiple inputs, an array of pointers, one for each input stream
"at end of images?"
boolean, used as loop exit condition
"clean up"
called when entire derive finishes (whether successfully or due to a failure) or is interrupted
returns all resources it's been using, deletes extra formats of sources files, etc.
The change could be implemented incrementally. Modules that are now fine could be left alone for the time being. Any new ones could use these image-access objects, and old ones could be retrofitted as time permits and/or as the need arises.
see my email to book-processing of Date: Mon, 28 May 2007 22:36:35 -0400 (EDT) and Date: Tue, 29 May 2007 02:25:10 -0400 (EDT)
Changed in deriver: | |
importance: | Undecided → Wishlist |
status: | New → Confirmed |
The two email messages referenced above (minus actual addresses):
Date: Mon, 28 May 2007 22:36:35 -0400 (EDT)
From: Hank Bromley <...>
To: Brewster Kahle <...>
Cc: Todd A Cass <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images
[...]
As a newcomer, I do find the variety of ways that modules access their with-slight- variations, and a disproportionate amount of code
source files confusing. It's easy to understand how that would come about
as the deriver developed organically, with new modules added over time, but
my impression is that as a result we now have a lot of
redundancy-
within each module dedicated to the task of finding, preparing, and stepping
through the source files.
I think it would be fine to run everything through some kind of preprocessor
that ensures everything is in canonical form. I do, though, have an
alternative suggestion: it seems to me this task that's largely shared
across modules but with some variation, and that involves messy interaction
with the outside world, is a good candidate for abstraction through object
classes. Say each module were passed an object that handled all the
interaction with source files, through standard messages like "how many
images are there?" and "give me the next one." The objects would come in
different flavors (I guess the term is "class" in PHP-world), each knowing
how to deal with different kinds of source material, but each would present
the same interface to the module using it. Modules, I think, really
shouldn't have to worry about how the individual image files are named,
whether an archive is tarred or zipped, or even whether the images are
stored in a Unix filesystem or something else entirely. Dealing with that
stuff is ugly and easy to mess up, and we're now having to figure it out
separately for each module.
It would indeed be a lot of work to change all the modules to use such a
mechanism, but the change could be done incrementally. Modules that are now
fine could be left alone for the time being. Any new ones could use these
image-access objects, and old ones could be retrofitted as time permits
and/or as the need arises.
-- Hank
Date: Tue, 29 May 2007 02:25:10 -0400 (EDT)
From: Hank Bromley <...>
To: Todd A Cass <...>
Cc: Brewster Kahle <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images
On Mon, 28 May 2007, Todd A. Cass wrote:
> I like the way you're thinking. I view that as steps down the path to (3)
> mentioned earlier: re-design the fundamentals of the book derivation. It
> can happen incrementally. In my expert opinion, it's less work than
> continuing to accrete changes on the existing thing...
I agree. If not less work in the short term, certainly less over the long
run.
> I'll help!
Cool. If Brewster's also okay with this approach, would it make sense to
start by trying to identify a minimal set of messages such "image access"
objects would need to handle, in order to serve all the current modules? We
could then begin to think about what kinds of classes/subclasses we need to
properly respond to those messages.
What ...