Comment 1 for bug 215266

Revision history for this message
Hank Bromley (hank-archive) wrote : Re: [Bug 215266] [NEW] ocr skipped on *_jp2.tar (not *_jp2.zip)

Spent some time looking into this when Dan first mentioned it to me, and
there are actually two interrelated symptoms here: (1) the derivation
stops prematurely, after doing only ProcessJP2, and (2) if the derive is
restarted, ProcessJP2 gets repeated, and would in fact rerun on every
deriver loop if we fixed symptom (1) without addressing the underlying
cause.

The basic cause of both problems is that we've told the deriver to treat
processed jp2 zips and processed jp2 tars as two different formats. Here's
what happens:

Not only was the Abbyy module not run, but neither were AnimatedGIF or
JpgFlipBookZip, both of which the deriver should have tried making from
the processed jp2 tar, according to derivations.xml.

Furthermore, there was no derivation analysis shown in the log after
ProcessJP2 finished.

So clearly the deriver thought it was done and stopped looking for new
formats to make, even though there were more that it could have made.
There are two conditions under which the deriver thinks it's finished:
either it's done ten iterations of the derive loop (not the case here, as
it did only one iteration), or it completes a pass during which no new
files were created. Of course it did make a new file, but the new file was
a processed jp2 tar (because at 6 GB it was too big for zip), not the zip
the ProcessJP2 module was asked to make, and thus by default doesn't
count.

ProcessJP2 could pass the tar to the "extraTarget" function, which *would*
get the deriver to count it as a new file, and thus cause the derive to
continue beyond the first iteration (correcting symptom #1), but that
wouldn't help with symptom #2: because there is no processed jp2 zip, on
every iteration the deriver will try again to make one - and end up making
the processed jp2 tar again. And given that we'll have a new processed jp2
tar after every pass, all the other modules will repeat, too. Infinite
loop, if not for the 10-pass maximum.

One ugly workaround would be to check, at the beginning of ProcessJP2,
whether there's a processed jp2 tar; if so, and if it's newer than the
sourceFile of the module, we exit immediately.

Perhaps cleaner, although more involved, would be combining processed jp2
zip and processed jp2 tar into a single format. I don't think the deriver
should care whether a file is a zip or a tar, just that it's a processed
jp2 archive. We could let the deriver treat them as one, and let modules
deal with figuring out which one they have. Actually the pack() and
unpack() functions in ImageArchive already deal with archives generically.
So modules would just need to deal with the tars and zips via the
ImageArchive abstraction.