Deriver

ocr skipped on _jp2.tar (not _jp2.zip)

Bug #215266 reported by danh on 2008-04-10

Affects		Status	Importance	Assigned to	Milestone
	Deriver	New	Undecided	Unassigned

Bug Description

Sometimes, ocr is simply not done even though there are *_jp2.tar files available.

For example, see the item NYTimes-Nov-1902. Here's the record of its derive task:
http://www.us.archive.org/log_show.php?task_id=25299953

This may be related to Bug #174567, because it seems to be somehow tied up with how formats and file extensions interact, and how modules interpret them.

Revision history for this message

Hank Bromley (hank-archive) wrote on 2008-04-10: Re: [Bug 215266] [NEW] ocr skipped on *_jp2.tar (not *_jp2.zip)

Spent some time looking into this when Dan first mentioned it to me, and
there are actually two interrelated symptoms here: (1) the derivation
stops prematurely, after doing only ProcessJP2, and (2) if the derive is
restarted, ProcessJP2 gets repeated, and would in fact rerun on every
deriver loop if we fixed symptom (1) without addressing the underlying
cause.

The basic cause of both problems is that we've told the deriver to treat
processed jp2 zips and processed jp2 tars as two different formats. Here's
what happens:

Not only was the Abbyy module not run, but neither were AnimatedGIF or
JpgFlipBookZip, both of which the deriver should have tried making from
the processed jp2 tar, according to derivations.xml.

Furthermore, there was no derivation analysis shown in the log after
ProcessJP2 finished.

So clearly the deriver thought it was done and stopped looking for new
formats to make, even though there were more that it could have made.
There are two conditions under which the deriver thinks it's finished:
either it's done ten iterations of the derive loop (not the case here, as
it did only one iteration), or it completes a pass during which no new
files were created. Of course it did make a new file, but the new file was
a processed jp2 tar (because at 6 GB it was too big for zip), not the zip
the ProcessJP2 module was asked to make, and thus by default doesn't
count.

ProcessJP2 could pass the tar to the "extraTarget" function, which *would*
get the deriver to count it as a new file, and thus cause the derive to
continue beyond the first iteration (correcting symptom #1), but that
wouldn't help with symptom #2: because there is no processed jp2 zip, on
every iteration the deriver will try again to make one - and end up making
the processed jp2 tar again. And given that we'll have a new processed jp2
tar after every pass, all the other modules will repeat, too. Infinite
loop, if not for the 10-pass maximum.

One ugly workaround would be to check, at the beginning of ProcessJP2,
whether there's a processed jp2 tar; if so, and if it's newer than the
sourceFile of the module, we exit immediately.

Perhaps cleaner, although more involved, would be combining processed jp2
zip and processed jp2 tar into a single format. I don't think the deriver
should care whether a file is a zip or a tar, just that it's a processed
jp2 archive. We could let the deriver treat them as one, and let modules
deal with figuring out which one they have. Actually the pack() and
unpack() functions in ImageArchive already deal with archives generically.
So modules would just need to deal with the tars and zips via the
ImageArchive abstraction.

Spent some time looking into this when Dan first mentioned it to me, and 
there are actually two interrelated symptoms here: (1) the derivation 
stops prematurely, after doing only ProcessJP2, and (2) if the derive is 
restarted, ProcessJP2 gets repeated, and would in fact rerun on every 
deriver loop if we fixed symptom (1) without addressing the underlying 
cause.

The basic cause of both problems is that we've told the deriver to treat 
processed jp2 zips and processed jp2 tars as two different formats. Here's 
what happens:

Not only was the Abbyy module not run, but neither were AnimatedGIF or 
JpgFlipBookZip, both of which the deriver should have tried making from 
the processed jp2 tar, according to derivations.xml.

Furthermore, there was no derivation analysis shown in the log after 
ProcessJP2 finished.

So clearly the deriver thought it was done and stopped looking for new 
formats to make, even though there were more that it could have made. 
There are two conditions under which the deriver thinks it's finished: 
either it's done ten iterations of the derive loop (not the case here, as 
it did only one iteration), or it completes a pass during which no new 
files were created. Of course it did make a new file, but the new file was 
a processed jp2 tar (because at 6 GB it was too big for zip), not the zip 
the ProcessJP2 module was asked to make, and thus by default doesn't 
count.

ProcessJP2 could pass the tar to the "extraTarget" function, which *would* 
get the deriver to count it as a new file, and thus cause the derive to 
continue beyond the first iteration (correcting symptom #1), but that 
wouldn't help with symptom #2: because there is no processed jp2 zip, on 
every iteration the deriver will try again to make one - and end up making 
the processed jp2 tar again. And given that we'll have a new processed jp2 
tar after every pass, all the other modules will repeat, too. Infinite 
loop, if not for the 10-pass maximum.

One ugly workaround would be to check, at the beginning of ProcessJP2, 
whether there's a processed jp2 tar; if so, and if it's newer than the 
sourceFile of the module, we exit immediately.

Perhaps cleaner, although more involved, would be combining processed jp2 
zip and processed jp2 tar into a single format. I don't think the deriver 
should care whether a file is a zip or a tar, just that it's a processed 
jp2 archive. We could let the deriver treat them as one, and let modules 
deal with figuring out which one they have. Actually the pack() and 
unpack() functions in ImageArchive already deal with archives generically. 
So modules would just need to deal with the tars and zips via the 
ImageArchive abstraction.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.