Speed up ProcessJP2 module

Bug #158435 reported by raj
2
Affects Status Importance Assigned to Milestone
Deriver
Confirmed
Medium
danh

Bug Description

The jpg->cropped/skewed ppm->pgm->jp2 path is very slow.

raj (raj-archive)
Changed in deriver:
assignee: nobody → raj-archive
importance: Undecided → High
raj (raj-archive)
Changed in deriver:
status: New → In Progress
raj (raj-archive)
Changed in deriver:
assignee: raj-archive → danh-archive
importance: High → Medium
status: In Progress → Confirmed
Revision history for this message
danh (danh-archive) wrote :

The current plan is to write a binary which goes from jpg to jp2, does rotation (deskewing), cropping, and contrast enhancement along the way, and emits statistics of the distribution of values in the components.

Further, it may emit more than one jp2 (one being "OrigJP2" and another being some kind of "FinalJP2").

An input jpg can be read in through leptonica.

The rotation can be done by Leptonica's pixRotate() function as Raj currently does in mflm-rotate-am2.c (petabox/sw/books/microfilm).

The cropping can be done Leptonica's pixRasterop() function, defined in rop.c. It would be desirable to do this in-place, but i don't see a way to do this short of hand-coding it. I believe that to make everything maintainable we want to avoid hand-coding wherever possible, so will just use pixRasterop() as it stands (meaning that we have to do one extra allocation of pixels for the cropped output).

The contrast enhancement can be done by Leptonica's pixTRCMap() function, defined in enhance.c. (This will do the same kind of contrast enhancement we currently do using pnmnorm with the bvalue and wvalue options. The way it works in Leptonica is that you supply a map [i.e., 256 values] which is applied to the source to produce the destination. So if you want 30 and below to go to 0, and 240 and above to go to 255, then you just populate the map that way [with 31 through 239 being linearly remapped using a linear map sending 30 to 0 and 240 to 255].)

Statistics can be gathered while the Leptonica output is sent to the back end (where they all have to be touched anyway while being permuted). The statistics, per Hank, include the min, max, mean, and standard deviation of each channel.

Because we want multiple outputs, i plan for the program to take multiple sets of arguments, each set consisting of a rotation (if any), crop (if any), enhancement arguments (if any), request for statistics (if any), and output file name with backend arguments. So calling in general will be something like:
    program infile rotation/crop/enhancement/stats/output1 rotation/crop/enhancement/stats/output2 ...

All image-changing operations will be done in the order specified, and everything will be cumulative (to avoid any extra copy steps).

I think we'll probably end up using it like
    program output1 rotatation/crop/enhancement/stats/output2
where output1 is our OrigJp2, and output2 is our processed jp2, but it's easier to program it more generally so that we use one chunk of code repeatedly.

If anybody sees any problems with this approach, or has any advice or changes, please let me know.

Revision history for this message
danh (danh-archive) wrote :

Regarding the contrast enhancement:

Leptonica and pnmnorm (which is the program currently used for contrast enhancement) seem to use the same algorithm, at least in the gray case. (The basic algorithm is that you create some map which sends [0, ... 255] to [0, ... 255] with the lower numbers getting pushed closer to 0 and the higher numbers getting pushed closer to 255.) The difference in implementation is that leptonica does not round, whereas pnmnorm does round. I enhanced an image using both leptonica and pnmnorm, with results in ~/recs/images/contrast_enhancement/. This directory also includes information on how to try out other images if desired. There are 7 remap values at which they differ in the particular map i tried (which is our standard 30-240/black-white map, as defined in petabox/www/common/OrigJp2ImOp.inc). I don't think i can tell the difference (except using software)---the gray values are 1 notch apart---and so i plan to just use the map leptonica provides to make the code easier to maintain.

But anybody please advise if this is too incautious (but for now the train is rolling ahead).

Revision history for this message
danh (danh-archive) wrote :

Here's a tentative plan for invocation of the program.

Please let me know of ways it could be more convenient (e.g., by environment variables, files, or any other mechanism) or just better (e.g., more consistent with other programs we use or have written, or more logical, or just easier to invoke for whatever reason).

The basic idea is that you give the program one input jpg file, and specify leptonica operations (zero, one, or more) as well as one or more output jp2 files. For every output file we can gather statistics on the distribution of the pixels which go into it (because we have to handle all the samples in the leptonica-to-back-end interface, so can count them while we handle them).

The overall command would be:
program {leptonica args}{outfile and backend args} {more leptonica args}{another outfile, back end args}...

(Here the curly-braces are just for visibly grouping the args, but can't be used in practice since the shell would eat them.)

The leptonica args would be zero or more of
  --force_gray
  --rotate degrees
  --crop x y w h
  --contrast_enhance b w
  --stat outfile

The outfile and backend args would be of the form
  --start_jp2 outfile.jp2 {backend arguments} --end_jp2

The backend arguments would be passed directly to it.

So a typical call might be something like
program infile.jpg --start_jp2 outfile_orig.jp2 -rate 0.4 --end_jp2 \
  --rotate 1.2 --crop 10 12 4500 5000 --contrast 30 240 --stat statfile--start_jp2 processed.jp2 -rate 0.4 --end_jp2

Here, for reference, -rate 0.4 is a backend argument which the backend interprets as a target compression rate: 0.4 bits per pixel. This would produce two outputs, one of which was just a compressed form of the input, and one of which was deskewed, cropped, and enhanced (and then compressed).

For the format of the statfile, i'm sort of semi-planning to make it consist of key=value lines, where the keys are [rgb]{min,max,mean,std} and the values are normalized to lie in the interval [0,1]. So typical keys would be rmin, gstd, bmean, etc. For grayscale, we could just leave off [rgb] in the key name.

But we could instead emit xml, or any other kind of output --- basically, whatever would be easiest for another program to parse. (I plan to do the stats last, so there's time to think about just what's easiest to read.)

(One more item for reference here: we can't quite use getopt() because we really want to pass the backend arguments to the backend just as they arrive to us [except for the name of the outfile, which will always be present, so we can just pack that up ourselves]. If we use getopt, it will consume arguments intended for the backend, which has its own scheme for arguments.)

Thanks in advance for any suggestions.

Revision history for this message
danh (danh-archive) wrote :

This is to note two things: (1) how to test that we have the right spatial resolution information stored, and (2) that we plan to put in exif data anyway, even though we'll have the resolution stored in a standard way.

(1) Testing the spatial resolution:
If you run the jhove program (http://en.wikipedia.org/wiki/JHOVE for a description; home page http://hul.harvard.edu/jhove/), in the form:
   jhove my_file.jp2
it will report a lot of metadata. If the jp2 box for the capture resolution is filled, part of this data will appear in
   JPEG2000Metadata > Codestreams > Codestream > NisoImageMetadata
in the fields SamplingFrequencyUnit, XSamplingFrequency, YSamplingFrequency.

For example, the sampling frequency unit might be centimeters, and the frequencies would in this case be in dots per centimeter.

If the jp2 box for the capture resolution is not filled, you won't see this. If the resolution is put in some other box, such as display resolution, you'll see something, but not the specific resolution numbers you put in.

The exif tool will also show the capture resolution info, but the "identify -verbose" command does not show this (ImageMagick version 6.2.4).

(Note that use of the capture box is called for in the National Digital Newspaper Program jpeg 2000 profile.)

(2) Now, even though we've already rigged up the software to do this (for color images, but grayscale will call the same functions i think), we still plan on inserting exif data for compatibility with existing tools (and the data we put in the header will still be a tiny, tiny percentage of the total file size).

[As usual, if anybody has any suggestions, advice, cautions, etc, please let me know, and thanks in advance.]

Revision history for this message
danh (danh-archive) wrote :

Just to make sure i'm not on the wrong page here:

For handling exif data: we will put in stubs for an interface to do this, but will not do it yet (because the Mekel does not produce exif data in the jpgs it emits). If and when the program is deployed where it does get explicit exif data in its input, then we will fill in the stubs to handle it (trying to take advantage of Hank's expertise here).

However, we will copy the comments from the jpg to the jp2 (since these are the only jpg metadata that we don't yet capture).

Will also put in some provision to copy an xml file into an xml box in the jp2 (which is needed for conformance with the National Digital Newspaper Project's specs for jp2 files: a little bibliographic data is required). (And this xml file will be generated by the calling php code.)

Otherwise plan to continue largely as specified above.

Thanks in advance for any feedback, especially if something looks fishy about this.

Revision history for this message
danh (danh-archive) wrote :

For reference, i'm coding as outlined above, and anticipate getting the c portion done shortly (day or two).

When wrapping in php, we'll need to create an xml file according to the national digital newspaper project (which will be saved as an xml box in the output jp2).

This file is described in part 5 of the JPEG 2000 Profile for the National Digital Newspaper Program (with an example in appendix F): http://www.loc.gov/ndnp/pdf/NDNP_JP2HistNewsProfile.pdf

I think the basic intent is to have embedded in each jp2 page image the:
    newspaper name
    location of publication
    date of publication
    page label
    Library of Congress (LOC) Catalog number

The profile makes it sound like this is all available from the marc information ("Dublin Core"). Some of this (e.g., the page label, which i think could be strings like "B2") i need some advice on how to proceed.

Just for reference, the details of the proposal from the LOC include these items:

   (1) Library of Congress catalog number for the serial ("normalized")
   (2) Date of publication (CCCC-MM-DD)
   (3) "Edition Order"
   (4) "Page Sequence Number"
   (5) Title
   (6) "Page label" (which i think would typically be a page number)
   (7) "Responsible Organization" (looks like sponsor)
   (8) Reel number
   (9) Reel sequence number

SO: i will need some advice on this a little later this week, and thanks in advance everybody for your suggestions or other help.

Revision history for this message
danh (danh-archive) wrote :

The c code code is checked into cvs now. If you refresh your tree, you should see it in petabox/sw/books/microfilm/jpgtojp2. If you type "make" in that directory, it should build. If it doesn't, please let me know.

Next step is to patch the php layer to use it in case of microfilm newspapers. For that, please send me any advice or cautions or concerns you have.

(Just for the record, for the c code itself, i think the biggest deficiency is that it doesn't copy exif data. But for the mekel microfilm, there isn't any exif data, so my plan is to postpone dealing with that until after the microfilm loop is completely closed. Hope to discuss the issue with Hank then. And there are lots of other ways it could be improved. For example, some of the libraries it uses, like leptonica, aren't part of our standard distribution, so i just pointed to copies in my home area. It is statically linked so hopefully this won't be an issue. And it is in cvs so anybody who wants to can fix or update any part of the code, or the makefile --- which could be made shorter be summarizing this pile of rules, one for every .o file, into 2 or 3 general rules.)

Revision history for this message
danh (danh-archive) wrote :

Talked with Raj about how to patch the php layer for calling the c program (to produce the original jp2s and the processed [deskewed, cropped, contrast-enhanced] jp2s).

The idea is that we'll make a new subclass of BookItemOperation, i.e., a book op, but we'll put it in the pre-derive operations so it will actually get invoked from the deriver. (That way, per Tracey, we'll keep the heavy processing in derive.) For reference, putting it on the pre-derive list means adding it to the $ops array defined in function preDeriveOperations in petabox/www/common/BookItemOperation.inc.

We'll see if we can get by asking Marcus (or whoever the operator is) to just not run the RePublisherPostprocess at all. This may or may not be possible: if it is not, we'll modify RePublisherPostprocess to just produce some very minimal *_files.xml file. If it is possible, then we'll have the book op create and populate the *_files.xml file. In any event, the book op will clear out the *.jpg files produced by the Mekel microfilm scanner.

The name of the book op is tentatively DevelopMekel.

If anybody spots an problems with this plan, or has any advice or suggestions, please let me know.

Revision history for this message
danh (danh-archive) wrote :

The php patching is not yet done, but want to report a few details so that anybody who wishes can provide suggestions, warnings, advice or any other kind of feedback.

The module is in petabox/www/common/DevelopMekel.inc and it is checked into cvs. It is not yet finished, but i think the form is pretty clear so that hopefully the flaws are visible. (To actually do anything with it, you would need to check it out from cvs, as well as petabox/etc/petabox-sw-config-us.xml and the c code in petabox/sw/books/microfilm/jpgtojp2, which you can build by typing "make".) (Also note that the php passes the flag --interface_check to the c program which inhibits any actual computations from happening so that only dummy files are produced. That's for development only.)

Further, to use it, you might have to modify BookItemOperation.inc: i have not checked in my changes there because i don't want this to go live yet.

I believe what's left to do is to:
(1) get the statistics out: i'll need some info here on how this is to be packed up (tar/zip? naming?)
(2) write the *_files.xml file (Raj has a script to do this which i plan to use if at all possible)
(3) get the info from Marcus about just what Library of Congress catalog numbers we can get from our existing marc and meta xml.

Revision history for this message
danh (danh-archive) wrote :

Contrary to prior, per Raj, the best way to deal with *_files.xml is to handcraft it to include just the orig_jp2.tar, the jp2.zip, and the *_meta.xml. So did that (through dom), and will confirm with him this is correct. [The idea is to give the deriver just enough info, and it can patch the *_files.xml further as needed, iiuc, but perhaps there are other details here that i'm missing.]

Also learned from Marcus who eyed the spec better than me that it is the control number we need, not a catalog number; this is indeed available in our marc for the NY Times. Per Marcus, other partners will have their own control numbers, not necessarily LOC. We'll try to stub that case reasonably for now, and flesh it out when we start processing their reels, depending on what they want.

The code as it stands is checked in as version 1.3 (petabox/www/common/DevelopMekel.inc).

Any feedback from anybody is welcome, especially if you see any problem with all of this.

Revision history for this message
danh (danh-archive) wrote :

Added the National Digital Newspaper Program (NDNP) meta data generation, and checked in the code (version 1.4 of DevelopMekel.inc). Checked in configuration file (petabox-sw-config-us.xml, version 1.196), and the BookItemOperation.inc (version 1.43).

(Binary image is not yet checked in, will talk to Raj about that.)

Revision history for this message
danh (danh-archive) wrote :

Talked to Raj, and have one more fix to make, since this will get run every time derive runs:

(1) Check the repub state is 4 and that on of the collections is microfilmreel. If either of these fail, then do not error out, but just return cleanly.

(2) Modify the BookItemOperation so that the DevelopMekel is put on the list of pre-derive operations.

Revision history for this message
danh (danh-archive) wrote :

Just a little bit more we have to do to make the pipe work correctly:

(1) need to change repub state from 4 to 6, collection microfilmreel to collection microfilm, and mediatype to texts (and write out the metaxml)
(2) need to clean up area a little more (delete index.txt).

(This will, e.g., let the abbyy module see the item as microfilm and adjust its arguments appropriately.)

Revision history for this message
danh (danh-archive) wrote :

Finally ran everything an example nytimes NYTimes-Oct-Dec-1859 through the deriver, with the DevelopMekel as a prederive option.

At least two problems are visible:
(1) the Mekel jpgs don't get deleted
(2) the subsequent AnimatedGif module fails on an identify command.

The log is at: http://www.us.archive.org/log_show.php?task_id=23140030&full=1

For reference: the log says the identify command failed with exit status 6. I tried the command by hand on the node which did the work (i believe ia350609), and got an identify failure ("Aborted") but with status 6. On home, identify succeeds, and jhove validates the output (made by the kakadu code). On ia350609 the identify version is 6.2.3 (1/24/06), while on home the identify version is 6.2.4 (10/02/07).

I plan to work on (1) first, and would appreciate any advice on how to deal with (2) [also would appreciate any advice on how to deal with (1) for that matter, but hopefully rm/unlink is easier to deal with (?)].

Revision history for this message
danh (danh-archive) wrote :

Correction to that previous comment: when i tried identify on the node, it failed with status 134, not 6.

Revision history for this message
danh (danh-archive) wrote :

An update here: the lack of deletion of the files is due to the fact that the derive as a whole does not finish.

So making the Mekel and AnimatedGIF compatible should solve both problems (or at least remove one obstacle to solution of the first).

Per Raj, right now we're investigating whether we can make a small mod to kdu_expand to fit into the pam pipeline: kdu_expand doesn't write to stdout, and also demands to be told in advance whether its input is gray or color, when it actually already knows so that it can generate error messages.

Revision history for this message
danh (danh-archive) wrote :

Modified kdu_expand to a new program kdutopam per Raj, as noted above: it only writes to stdout. The -o argument triggers an error, but all other arguments are just as in kdu_expand.

It was possible to do this because of an ingenious suggestion by Raj: we just get around the kakadu suffix checking one way or another, and pass in the file name "/dev/stdout" for the output file. (This apparently is standard on ubuntu and maybe on linux, but i was unaware of it, being stuck in the old /dev/tty days. Only wrinkle is to
make sure that all regular stdout doesn't go there, but that turned out to be easy to fix, just replacing one cout with cerr.)

This works.

The sources are now checked into the tree in petabox/sw/books/microfilm/kdutopam/. cd to that directory and
type make to build the program. If it doesn't build for anybody, please let me know. (It's rigged up to compile on gutsy and breezy, but i haven't made libraries for intermediates c..., edgy, feisty. Will do that if needed, but it think our practice is to build for breezy. The gutsy build is sort of just for convenience. Have tested on gutsy but not breezy yet.)

Will confer with Raj tomorrow about how to fit this in to get the animated gifs animated again for newspapers.

Revision history for this message
danh (danh-archive) wrote :

The next step is to make a zip of single page pdfs (for microfilm newspapers).

Per Hank, there are three ways this could be done:

(1) In AbbyyXML, where we do the ocr, we could set the $maxFiles parameter, which governs how many images are ocr'ed at a time, to 1. Then we'd just not concatenate them at the end, but put them in a tar file. We'd adjust the subsequent modules, DjvuXML, and LuraTechPDF, to deal with a tar (or zip) of abbyy rather than a single large one. (DjvuXML would also have to produce multiple files, to be tarred or zipped up in this version as well.) Then in the LuraTechPDF module we'd loop over the DjvuXML files and make a pdf for each one.

(2) Alternatively, we'd leave AbbyyXML and DjvuXML alone, but in the LuraTechPDF module, break up the djvuxml into individual pages, and form a pdf for each of them, using BuildPDF (itext) as it stands now (except to call it repeatedly, once per page).

(3) Or, similar to (2), we'd leave AbbyyXML and DjvuXML alone, and modify LuraTechPDF only slightly: we'd pass an argument in to BuildPDF to say that we want a zip of single page pdfs made, instead of just a single gigantic pdf.

The third option may not be so wacky because the java code already includes support for (e.g.) parsing scandata. But it may also increase the maintenance burden because it's another language to support (i.e., it may tend to split the codebase).

And of course option (1) involves touching a lot of code (and potentially losing potential products if we fork the paths poorly).

So for now, we're planning on going forward with option (2) unless we hit some roadblocks.

Thanks in advance for any feedback (and Hank if i've mangled up the options please post a correction, and thanks for going through the LuraTech with me).

Revision history for this message
danh (danh-archive) wrote :

Hank set up a meeting with Todd. Todd came by, showed us around the Build_PDF java code (main loop is in the BuildSearchablePDF class), fixed one bug which Hank had uncovered, and checked in all the sources to the petabox tree (root is in petabox/sw/books/buildpdf). (Bug fixed library is not checked in yet; we want to try it out a little more on some pdf generation first.) He also showed us how to upgrade to point to the latest version of itext which has built-in jp2 support (what we're currently using has his hand-built jp2 support).

For now plan (per Todd) to move the single page zipping right into the java, because this code already steps through the djvu xml so should be able to skip most of the development effort there (will talk to Hank, Raj, and Steve about this).

 As always, thanks for any feedback, advice, or cautions.

Revision history for this message
danh (danh-archive) wrote :

I'm summarizing the current issues so that everybody has a picture of what's going on.

The biggest problem is that the jp2s as we currently create them with DevelopMekel cause problems in the pdfs, as we currently create them with itext and read them with xpdf and acroread.

There are actually two problems: progressive jp2s are bad for xpdf (and i think evince), while embedded xml is bad for acroread. Note that the effects seem to be independent: that is, adobe can handle progressive as long as there is no xml, and xpdf can handle xml as long as the jp2 is not progressive.

(Note that the xml/progressive stuff is there in the first place to be more nearly NDNP conformant.)

We could try to solve this at the DevelopMekel end (by producing different or more products), or the itext end.

Raj says that there might be some way to do a lossless jp2-to-jp2 conversion to drop the progressive encoding, and of course we'd arrange to drop the xml at the same time if we did that: should be no need to have embedded xml inside an embedded jp2, because we'd just try to embed it directly. (Raj says that this can be done with jpg, for example.) Note also that i've tried to arrange for the backend to deposit the xml before the codestream, so hopefully it would be easy to remove.

On another matter:

For the words for which we get bad font size information, it is very easy to just drop them (we simply don't pass them along to itext). That's what i'm doing now, for my working copy of Todd's software. (But i haven't really bored to the bottom of that problem: i wanted to see the bad characters in context, so i forced out the pdf, but when it came out, it was so awful that i had to deal with it first.)

So, my plan is for now to:
(1) fix up the single page generation and get the mechanics of that worked out, and then
(2) deal with the xml/progressive problems. (That way i can accumulate any xml/progressive feedback while i deal with the other problem.)

Thanks in advance for any feedback from anybody.

Revision history for this message
danh (danh-archive) wrote :

Talked with Brewster this morning. Looks like we can maybe fix this up by running the results through luratech compress --- at least we'll try. (We guess it will take about 40 seconds per page.)

Note that Luratech does not produce the same kind of inserted images as itext does. (For itext, these are represented as <</Filter/JPXDecode/Type/XObject/Length *****/Height ****/Subtype/Image/Width ****>>. For luratech, these are <</Linearized 1/L ****/H[**** ****]/O ****/E ****/N ***/T ******>>.) (This is based on looking at what luratech does to a reel we processed in December, which went through an intermediate jp2 stage.)

If anybody has other ideas or sees problems with this, please let me know.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.