Deriver

Bug #174567
Comment #1

Comment 1 for bug 174567

Revision history for this message

Hank Bromley (hank-archive) wrote on 2007-12-19:

The two email messages referenced above (minus actual addresses):

Date: Mon, 28 May 2007 22:36:35 -0400 (EDT)
From: Hank Bromley <...>
To: Brewster Kahle <...>
Cc: Todd A Cass <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images

[...]

As a newcomer, I do find the variety of ways that modules access their
source files confusing. It's easy to understand how that would come about
as the deriver developed organically, with new modules added over time, but
my impression is that as a result we now have a lot of
redundancy-with-slight-variations, and a disproportionate amount of code
within each module dedicated to the task of finding, preparing, and stepping
through the source files.

I think it would be fine to run everything through some kind of preprocessor
that ensures everything is in canonical form. I do, though, have an
alternative suggestion: it seems to me this task that's largely shared
across modules but with some variation, and that involves messy interaction
with the outside world, is a good candidate for abstraction through object
classes. Say each module were passed an object that handled all the
interaction with source files, through standard messages like "how many
images are there?" and "give me the next one." The objects would come in
different flavors (I guess the term is "class" in PHP-world), each knowing
how to deal with different kinds of source material, but each would present
the same interface to the module using it. Modules, I think, really
shouldn't have to worry about how the individual image files are named,
whether an archive is tarred or zipped, or even whether the images are
stored in a Unix filesystem or something else entirely. Dealing with that
stuff is ugly and easy to mess up, and we're now having to figure it out
separately for each module.

It would indeed be a lot of work to change all the modules to use such a
mechanism, but the change could be done incrementally. Modules that are now
fine could be left alone for the time being. Any new ones could use these
image-access objects, and old ones could be retrofitted as time permits
and/or as the need arises.

-- Hank

Date: Tue, 29 May 2007 02:25:10 -0400 (EDT)
From: Hank Bromley <...>
To: Todd A Cass <...>
Cc: Brewster Kahle <...>, Book Processing <...>
Subject: Re: [Book-processing] generalizing the deriver to handle directories of images

On Mon, 28 May 2007, Todd A. Cass wrote:

> I like the way you're thinking. I view that as steps down the path to (3)
> mentioned earlier: re-design the fundamentals of the book derivation. It
> can happen incrementally. In my expert opinion, it's less work than
> continuing to accrete changes on the existing thing...

I agree. If not less work in the short term, certainly less over the long
run.

> I'll help!

Cool. If Brewster's also okay with this approach, would it make sense to
start by trying to identify a minimal set of messages such "image access"
objects would need to handle, in order to serve all the current modules? We
could then begin to think about what kinds of classes/subclasses we need to
properly respond to those messages.

What I'm imagining is that at the beginning of each derive a single object
would be created (whose class might depend on what form the source files are
available in), and that object would be passed to each module in turn. The
advantage of having it be a single object is preserving certain state
information across modules, to reduce duplicated effort. For instance,
let's say the source images are tifs and a given module wants only to deal
with jp2s (like the LuraTechPDF module, which currently converts tifs to
jp2s). The image access object would do the conversions for the first
module that requested jp2s, but thereafter would provide the same jp2s
without recreating them, because it would know it had already made them for
a previous module. And because the same object would still be around at the
end of the derive, it could clean up any extra formats it had created.

Here's a first pass on some messages we'll need to handle:

"image count"
returns the total number of source images (needed for 250-chunking, etc.)

"give me the next image" (optionally "as jp2" or "as anything but tif", etc.)
returns a pointer of some kind to the file (a string containing fully qualified file spec?)
or, for modules that merge multiple inputs, an array of pointers, one for each input stream

"at end of images?"
boolean, used as loop exit condition

"clean up"
called when entire derive finishes (whether successfully or due to a failure) or is interrupted
returns all resources it's been using, deletes extra formats of sources files, etc.

I'm still pondering how best to arrange for a single object to respond to
the same "give me the next image" message with either a single pointer or an
array of pointers, depending on which module is asking. Perhaps the first
thing each module does is call a "setup" method, indicating what kind and
how many inputs it wants, which could set some instance variables or create
a subsidiary, more specialized image-access object. Or perhaps the module
object is queried about its requirements before its derive method is called,
and the more specialized object (containing the state information from its
parent, or perhaps the parent itself) is passed in the derive call.

-- Hank

The two email messages referenced above (minus actual addresses):

[...]

As a newcomer, I do find the variety of ways that modules access their
source files confusing.  It's easy to understand how that would come about
as the deriver developed organically, with new modules added over time, but
my impression is that as a result we now have a lot of
redundancy-with-slight-variations, and a disproportionate amount of code
within each module dedicated to the task of finding, preparing, and stepping
through the source files.

I think it would be fine to run everything through some kind of preprocessor
that ensures everything is in canonical form.  I do, though, have an
alternative suggestion:  it seems to me this task that's largely shared
across modules but with some variation, and that involves messy interaction
with the outside world, is a good candidate for abstraction through object
classes.  Say each module were passed an object that handled all the
interaction with source files, through standard messages like "how many
images are there?" and "give me the next one."  The objects would come in
different flavors (I guess the term is "class" in PHP-world), each knowing
how to deal with different kinds of source material, but each would present
the same interface to the module using it.  Modules, I think, really
shouldn't have to worry about how the individual image files are named,
whether an archive is tarred or zipped, or even whether the images are
stored in a Unix filesystem or something else entirely.  Dealing with that
stuff is ugly and easy to mess up, and we're now having to figure it out
separately for each module.

It would indeed be a lot of work to change all the modules to use such a
mechanism, but the change could be done incrementally.  Modules that are now
fine could be left alone for the time being.  Any new ones could use these
image-access objects, and old ones could be retrofitted as time permits
and/or as the need arises.

-- Hank

On Mon, 28 May 2007, Todd A. Cass wrote:

I agree.  If not less work in the short term, certainly less over the long
run.

> I'll help!

Cool.  If Brewster's also okay with this approach, would it make sense to
start by trying to identify a minimal set of messages such "image access"
objects would need to handle, in order to serve all the current modules? We
could then begin to think about what kinds of classes/subclasses we need to
properly respond to those messages.

What I'm imagining is that at the beginning of each derive a single object
would be created (whose class might depend on what form the source files are
available in), and that object would be passed to each module in turn. The
advantage of having it be a single object is preserving certain state
information across modules, to reduce duplicated effort.  For instance,
let's say the source images are tifs and a given module wants only to deal
with jp2s (like the LuraTechPDF module, which currently converts tifs to
jp2s).  The image access object would do the conversions for the first
module that requested jp2s, but thereafter would provide the same jp2s
without recreating them, because it would know it had already made them for
a previous module.  And because the same object would still be around at the
end of the derive, it could clean up any extra formats it had created.

Here's a first pass on some messages we'll need to handle:

"image count"
  returns the total number of source images (needed for 250-chunking, etc.)

"give me the next image" (optionally "as jp2" or "as anything but tif", etc.)
  returns a pointer of some kind to the file (a string containing fully qualified file spec?)
  or, for modules that merge multiple inputs, an array of pointers, one for each input stream

"at end of images?"
  boolean, used as loop exit condition

"clean up"
  called when entire derive finishes (whether successfully or due to a failure) or is interrupted
  returns all resources it's been using, deletes extra formats of sources files, etc.

I'm still pondering how best to arrange for a single object to respond to
the same "give me the next image" message with either a single pointer or an
array of pointers, depending on which module is asking.  Perhaps the first
thing each module does is call a "setup" method, indicating what kind and
how many inputs it wants, which could set some instance variables or create
a subsidiary, more specialized image-access object.  Or perhaps the module
object is queried about its requirements before its derive method is called,
and the more specialized object (containing the state information from its
parent, or perhaps the parent itself) is passed in the derive call.

-- Hank