clean up file/archive access in deriver

Bug #174567 reported by Hank Bromley
2
Affects Status Importance Assigned to Milestone
Deriver
Confirmed
Wishlist
Unassigned

Bug Description

Rethink some of the object classes to better encapsulate knowledge about names and formats of image file and image archives. We now have a lot of redundancy-with-slight-variations, and a disproportionate amount of code within each module dedicated to the task of finding, preparing, and stepping through the source files.

If those tasks are abstracted into object classes (each class could know how to deal with different kinds of source images, but all would present the same interface to the module using it), modules won't have to worry about how the individual image files are named, whether an archive is tarred or zipped, etc. That stuff is ugly and easy to mess up, and is being done repeatedly in nearly every module.

The deriver could pass each module a file-handling object of the right class for the source format. The messages the object would need to handle might include:

"image count"
  returns the total number of source images (needed for 250-chunking, etc.)

"give me the next image"
  returns a pointer of some kind to the file (a string containing fully qualified file spec?)
  or, for modules that merge multiple inputs, an array of pointers, one for each input stream

"at end of images?"
  boolean, used as loop exit condition

"clean up"
  called when entire derive finishes (whether successfully or due to a failure) or is interrupted
  returns all resources it's been using, deletes extra formats of sources files, etc.

The change could be implemented incrementally. Modules that are now fine could be left alone for the time being. Any new ones could use these image-access objects, and old ones could be retrofitted as time permits and/or as the need arises.

see my email to book-processing of Date: Mon, 28 May 2007 22:36:35 -0400 (EDT) and Date: Tue, 29 May 2007 02:25:10 -0400 (EDT)

Changed in deriver:
importance: Undecided → Wishlist
status: New → Confirmed
Revision history for this message
Hank Bromley (hank-archive) wrote :
Download full text (5.3 KiB)

The two email messages referenced above (minus actual addresses):

Date: Mon, 28 May 2007 22:36:35 -0400 (EDT)
From: Hank Bromley <...>
To: Brewster Kahle <...>
Cc: Todd A Cass <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images

[...]

As a newcomer, I do find the variety of ways that modules access their
source files confusing. It's easy to understand how that would come about
as the deriver developed organically, with new modules added over time, but
my impression is that as a result we now have a lot of
redundancy-with-slight-variations, and a disproportionate amount of code
within each module dedicated to the task of finding, preparing, and stepping
through the source files.

I think it would be fine to run everything through some kind of preprocessor
that ensures everything is in canonical form. I do, though, have an
alternative suggestion: it seems to me this task that's largely shared
across modules but with some variation, and that involves messy interaction
with the outside world, is a good candidate for abstraction through object
classes. Say each module were passed an object that handled all the
interaction with source files, through standard messages like "how many
images are there?" and "give me the next one." The objects would come in
different flavors (I guess the term is "class" in PHP-world), each knowing
how to deal with different kinds of source material, but each would present
the same interface to the module using it. Modules, I think, really
shouldn't have to worry about how the individual image files are named,
whether an archive is tarred or zipped, or even whether the images are
stored in a Unix filesystem or something else entirely. Dealing with that
stuff is ugly and easy to mess up, and we're now having to figure it out
separately for each module.

It would indeed be a lot of work to change all the modules to use such a
mechanism, but the change could be done incrementally. Modules that are now
fine could be left alone for the time being. Any new ones could use these
image-access objects, and old ones could be retrofitted as time permits
and/or as the need arises.

-- Hank

Date: Tue, 29 May 2007 02:25:10 -0400 (EDT)
From: Hank Bromley <...>
To: Todd A Cass <...>
Cc: Brewster Kahle <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images

On Mon, 28 May 2007, Todd A. Cass wrote:

> I like the way you're thinking. I view that as steps down the path to (3)
> mentioned earlier: re-design the fundamentals of the book derivation. It
> can happen incrementally. In my expert opinion, it's less work than
> continuing to accrete changes on the existing thing...

I agree. If not less work in the short term, certainly less over the long
run.

> I'll help!

Cool. If Brewster's also okay with this approach, would it make sense to
start by trying to identify a minimal set of messages such "image access"
objects would need to handle, in order to serve all the current modules? We
could then begin to think about what kinds of classes/subclasses we need to
properly respond to those messages.

What ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.