The two email messages referenced above (minus actual addresses): Date: Mon, 28 May 2007 22:36:35 -0400 (EDT) From: Hank Bromley <...> To: Brewster Kahle <...> Cc: Todd A Cass <...>, Book Processing <...> Subject: Re: [Book-processing] generalizing the deriver to handle directories of images [...] As a newcomer, I do find the variety of ways that modules access their source files confusing. It's easy to understand how that would come about as the deriver developed organically, with new modules added over time, but my impression is that as a result we now have a lot of redundancy-with-slight-variations, and a disproportionate amount of code within each module dedicated to the task of finding, preparing, and stepping through the source files. I think it would be fine to run everything through some kind of preprocessor that ensures everything is in canonical form. I do, though, have an alternative suggestion: it seems to me this task that's largely shared across modules but with some variation, and that involves messy interaction with the outside world, is a good candidate for abstraction through object classes. Say each module were passed an object that handled all the interaction with source files, through standard messages like "how many images are there?" and "give me the next one." The objects would come in different flavors (I guess the term is "class" in PHP-world), each knowing how to deal with different kinds of source material, but each would present the same interface to the module using it. Modules, I think, really shouldn't have to worry about how the individual image files are named, whether an archive is tarred or zipped, or even whether the images are stored in a Unix filesystem or something else entirely. Dealing with that stuff is ugly and easy to mess up, and we're now having to figure it out separately for each module. It would indeed be a lot of work to change all the modules to use such a mechanism, but the change could be done incrementally. Modules that are now fine could be left alone for the time being. Any new ones could use these image-access objects, and old ones could be retrofitted as time permits and/or as the need arises. -- Hank Date: Tue, 29 May 2007 02:25:10 -0400 (EDT) From: Hank Bromley <...> To: Todd A Cass <...> Cc: Brewster Kahle <...>, Book Processing <...> Subject: Re: [Book-processing] generalizing the deriver to handle directories of images On Mon, 28 May 2007, Todd A. Cass wrote: > I like the way you're thinking. I view that as steps down the path to (3) > mentioned earlier: re-design the fundamentals of the book derivation. It > can happen incrementally. In my expert opinion, it's less work than > continuing to accrete changes on the existing thing... I agree. If not less work in the short term, certainly less over the long run. > I'll help! Cool. If Brewster's also okay with this approach, would it make sense to start by trying to identify a minimal set of messages such "image access" objects would need to handle, in order to serve all the current modules? We could then begin to think about what kinds of classes/subclasses we need to properly respond to those messages. What I'm imagining is that at the beginning of each derive a single object would be created (whose class might depend on what form the source files are available in), and that object would be passed to each module in turn. The advantage of having it be a single object is preserving certain state information across modules, to reduce duplicated effort. For instance, let's say the source images are tifs and a given module wants only to deal with jp2s (like the LuraTechPDF module, which currently converts tifs to jp2s). The image access object would do the conversions for the first module that requested jp2s, but thereafter would provide the same jp2s without recreating them, because it would know it had already made them for a previous module. And because the same object would still be around at the end of the derive, it could clean up any extra formats it had created. Here's a first pass on some messages we'll need to handle: "image count" returns the total number of source images (needed for 250-chunking, etc.) "give me the next image" (optionally "as jp2" or "as anything but tif", etc.) returns a pointer of some kind to the file (a string containing fully qualified file spec?) or, for modules that merge multiple inputs, an array of pointers, one for each input stream "at end of images?" boolean, used as loop exit condition "clean up" called when entire derive finishes (whether successfully or due to a failure) or is interrupted returns all resources it's been using, deletes extra formats of sources files, etc. I'm still pondering how best to arrange for a single object to respond to the same "give me the next image" message with either a single pointer or an array of pointers, depending on which module is asking. Perhaps the first thing each module does is call a "setup" method, indicating what kind and how many inputs it wants, which could set some instance variables or create a subsidiary, more specialized image-access object. Or perhaps the module object is queried about its requirements before its derive method is called, and the more specialized object (containing the state information from its parent, or perhaps the parent itself) is passed in the derive call. -- Hank