derived two microfilm reels with unexpected results

Bug #824878 reported by bell@archive.org
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Internet Archive - Tech Support
Confirmed
Undecided
danh

Bug Description

The following two microfilm items were loaded with the 'eng' language code. In our process, microfilm reels with 'eng' are uploaded but not derived until further action is taken.

1. Both of these items were uploaded.
2. Both items were derived them using the 'Item Manager' link on each item's details page.

item #1:
http://www.archive.org/details/may7186316theh
- no PDF
- no Full Text link
- no DjVu

item #2
http://www.archive.org/details/dakotaconflictof00unse
- only Read Online and PDF links
- not OCR'ed
- additionally, at one point, this item's language code was set to 'English-Handwritten'

Here is an example of a microfilm item that derived properly:
http://www.archive.org/details/reportofannualme80

Revision history for this message
Jude Coelho (judec) wrote :

Dan, do you know anything about this?

Changed in ia-techsupport:
assignee: nobody → danh (danh-archive)
status: New → Confirmed
Revision history for this message
Hank Bromley (hank-archive) wrote :

On item #1, see my email of July 13, with subject "Re: auto_submit for eng newspaper reels," sent to Jesse, Jude, Paul, Venus and Dan. It includes this (note especially the part about "just putting items into the 'newspaper' collection) . . .

= = = = = =

whether we make the single-page PDFs instead of the whole-item PDF is determined by whether the item is in the "newspapers" collection

whether the item is in "newspaper" also affects the DevelopMekel in-derive book-op (this is dan's code, so he may remember more about it), which runs near the beginning of the derive *but only for items that don't yet have jp2s*. previous newspaper scans were uploaded as jpgs and converted to jp2 by this book-op; because we now upload jp2s, it doesn't run. when it does run on a newspaper, it sets pagination=true for the item (causing the "View PDFs" widget to be displayed on the details page) and inserts metadata into each jp2 it makes, to be compliant with the NDNP standard - we currently have no path for inserting those metadata into pre-existing jp2s

just putting items into the "newspaper" collection, while still uploading them as jp2s and skipping RePublisher, yields the result Jesse mentioned (http://www.archive.org/details/december20188702dulu): we make the single-page PDFs:

http://www.archive.org/download/december20188702dulu/december20188702dulu_pdf.zip/

but there's no information available on the issue dates, and initially the "View PDFs" widget doesn't appear, either - I manually added pagination=true to this item to make it appear, but you can see the date info isn't filled in. and the individual jp2s don't have the metadata that DevelopMekel would have inserted.

= = = = = =

This item is in the newspapers collection; if you check the derive log, you'll see that exactly as described above, we made a _pdf.zip instead of a .pdf (and aren't displaying the "View PDFs" widget because the item was uploaded with jp2s rather than jpgs). The derive log also shows that we also skipped making DjVu because the item is in the newspapers collection. Without DjVu, we don't make "full text."

This is getting a little surreal. I keep saying that we're not set up to process newspapers, and it would require a serious engineering effort to become able to with our current microfilm workflow - and yet people keep trying, and being surprised when it doesn't work.

Revision history for this message
Hank Bromley (hank-archive) wrote :

Item #2 is also no mystery if you check the derive log. At the time of the first derive (task 75836973), AbbyyXML said "language not currently OCRable" - that was during the time that the item had language=English-handwritten. With no OCR results, we get no "Ful Text."

Much later, the language was changed to "eng" (task 82120319). Another derive was attempted afterward, but it had no remove_derived value specified, so all the derivatives were left in place, and the derive did nothing. A rederive with remove_derived value of "*abbyy.gz" (or better, "{*abbyy.gz,*djvu.xml,*.pdf,*pdf.zip}") would obtain OCR results. But then you'll still end up in the same situation as item #1.

Revision history for this message
bell@archive.org (bell-archive) wrote :

Thanks for response, Hank!

I was unaware that these items had 'newspaper' collection added to the collection string (this was done separately because of automatic billing). The 'newspaper' designation will be removed from these items.

For this collection of reels (Minnesota Historical Society), the partner is pleased with normal OCR of these items (meaning, one whole-item PDF per item).

My questions:
- when the 'newspaper' designation has been removed from the collection string, can these items be derived/OCR'ed?
- if all items are able to be derived/OCR'ed, how long would that take to complete for 2,855 items without causing a disruption?

Revision history for this message
bell@archive.org (bell-archive) wrote :

Sorry for the rush, but the partner would like four of these reels OCR'ed by Tuesday 8/23. Will this be possible?

Revision history for this message
Hank Bromley (hank-archive) wrote :

For item #1, change the collections to remove "newspapers" then rederive with remove_derived=*pdf.zip. It shouldn't take too long to run, as it has already done OCR.

For item #2, likewise fix the collections first, then rederive. As mentioned in a previous comment, because this one needs to redo OCR the remove_derived value has to include *abbyy.gz, and the best value to use is {*abbyy.gz,*djvu.xml,*.pdf,*pdf.zip}.

How long it will take to do more from scratch is hard to say. It depends on their size and on how hard they are to OCR. Item #1 (an issue of "The Hastings Conserver") took 7 minutes per page to OCR, so that can provide a rough guideline.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.