ocr skipped on *_jp2.tar (not *_jp2.zip)

Bug #215266 reported by danh
2
Affects Status Importance Assigned to Milestone
Deriver
New
Undecided
Unassigned

Bug Description

Sometimes, ocr is simply not done even though there are *_jp2.tar files available.

For example, see the item NYTimes-Nov-1902. Here's the record of its derive task:
http://www.us.archive.org/log_show.php?task_id=25299953

This may be related to Bug #174567, because it seems to be somehow tied up with how formats and file extensions interact, and how modules interpret them.

Revision history for this message
Hank Bromley (hank-archive) wrote : Re: [Bug 215266] [NEW] ocr skipped on *_jp2.tar (not *_jp2.zip)

Spent some time looking into this when Dan first mentioned it to me, and
there are actually two interrelated symptoms here: (1) the derivation
stops prematurely, after doing only ProcessJP2, and (2) if the derive is
restarted, ProcessJP2 gets repeated, and would in fact rerun on every
deriver loop if we fixed symptom (1) without addressing the underlying
cause.

The basic cause of both problems is that we've told the deriver to treat
processed jp2 zips and processed jp2 tars as two different formats. Here's
what happens:

Not only was the Abbyy module not run, but neither were AnimatedGIF or
JpgFlipBookZip, both of which the deriver should have tried making from
the processed jp2 tar, according to derivations.xml.

Furthermore, there was no derivation analysis shown in the log after
ProcessJP2 finished.

So clearly the deriver thought it was done and stopped looking for new
formats to make, even though there were more that it could have made.
There are two conditions under which the deriver thinks it's finished:
either it's done ten iterations of the derive loop (not the case here, as
it did only one iteration), or it completes a pass during which no new
files were created. Of course it did make a new file, but the new file was
a processed jp2 tar (because at 6 GB it was too big for zip), not the zip
the ProcessJP2 module was asked to make, and thus by default doesn't
count.

ProcessJP2 could pass the tar to the "extraTarget" function, which *would*
get the deriver to count it as a new file, and thus cause the derive to
continue beyond the first iteration (correcting symptom #1), but that
wouldn't help with symptom #2: because there is no processed jp2 zip, on
every iteration the deriver will try again to make one - and end up making
the processed jp2 tar again. And given that we'll have a new processed jp2
tar after every pass, all the other modules will repeat, too. Infinite
loop, if not for the 10-pass maximum.

One ugly workaround would be to check, at the beginning of ProcessJP2,
whether there's a processed jp2 tar; if so, and if it's newer than the
sourceFile of the module, we exit immediately.

Perhaps cleaner, although more involved, would be combining processed jp2
zip and processed jp2 tar into a single format. I don't think the deriver
should care whether a file is a zip or a tar, just that it's a processed
jp2 archive. We could let the deriver treat them as one, and let modules
deal with figuring out which one they have. Actually the pack() and
unpack() functions in ImageArchive already deal with archives generically.
So modules would just need to deal with the tars and zips via the
ImageArchive abstraction.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.