right to left in scribe scanning- for yiddish, korean,, ect text

Bug #340203 reported by robert-miller
2
Affects Status Importance Assigned to Milestone
Scribe2
Fix Released
High
danh

Bug Description

we need this for both yiddish books and korean texts. Scanning will start at the end of March.

Matt Work (mwork)
Changed in scribe2:
assignee: nobody → raj-archive
Revision history for this message
danh (danh-archive) wrote :

From conversation with Robert, our intention is for the books
to be scanned starting with the right cover and working left.

Hank has made some recommendations for this project,
which we intend to try out:

* add a checkbox for "right-to-left book" on the biblio item-loading screen at
http://www.us.archive.org/biblio; the effect should be to prepopulate the
Metaform with the appropriate value (in this case, page-progression = rl), just
as we already do for the other input fields on the biblio screen
(scanningcenter, collection, contributor, sponsor)

* at the outset of imaging, check for page-progression = rl in meta.xml, and if
present, set a flag; then on lines 225 and 236 of
petabox/www/datanode/scribe/ScriblioProcess.inc, we set the 'num' value
accordingly (currently the right image unconditionally gets the +1; if the rl
flag is set, we just +1 the left image instead)

He then says that he "think[s] the 'num' value then propagates through jpg and jp2 creation".

Revision history for this message
danh (danh-archive) wrote :

i think this is one i should be doing, but i can give it up if that's not right.

Changed in scribe2:
assignee: raj-archive → danh-archive
Revision history for this message
robert-miller (robert-archive) wrote : Re: [Bug 340203] Re: right to left in scribe scanning- for yiddish, korean, , ect text

thanks dan!

robert

danh wrote:
> >From conversation with Robert, our intention is for the books
> to be scanned starting with the right cover and working left.
>
> Hank has made some recommendations for this project,
> which we intend to try out:
>
> * add a checkbox for "right-to-left book" on the biblio item-loading screen at
> http://www.us.archive.org/biblio; the effect should be to prepopulate the
> Metaform with the appropriate value (in this case, page-progression = rl), just
> as we already do for the other input fields on the biblio screen
> (scanningcenter, collection, contributor, sponsor)
>
> * at the outset of imaging, check for page-progression = rl in meta.xml, and if
> present, set a flag; then on lines 225 and 236 of
> petabox/www/datanode/scribe/ScriblioProcess.inc, we set the 'num' value
> accordingly (currently the right image unconditionally gets the +1; if the rl
> flag is set, we just +1 the left image instead)
>
> He then says that he "think[s] the 'num' value then propagates through
> jpg and jp2 creation".
>
>

raj (raj-archive)
Changed in scribe2:
importance: Undecided → High
Revision history for this message
danh (danh-archive) wrote :

Implemented the first half of Hank's program above, which is to put
in a checkbox in the loader's first screen of the biblio gui and write
that out to the meta xml.

The details involve putting in the form (BiblioForms.inc), capturing
and carrying the value through the javascript (biblio.js), and
ultimately submitting them.

This has to work both for when you do a biblio search but also for
when you do a "roll-your-own": we anticipate that there may be
issues in locating bibliographics for some right-to-left books.

Finally, on the metaform page, the value has to be editable
as are the other values on that page.

The changed files are:
 petabox/www/common/BiblioForms.inc (1.39)
 petabox/www/common/BiblioSubmit.inc (1.8)
 petabox/www/datanode/biblio.php (1.55)
 petabox/www/petabox/includes/biblio.css (1.19)
 petabox/www/petabox/includes/biblio.js (1.19)

The second half of Hank's program is not yet implemented.

Revision history for this message
danh (danh-archive) wrote : two possible paths for right-to-left info to flow into image loop, 340203

Hi Book Processing,

We plan to scan some books right-to-left.

Hank wrote up what i think is a good plan for doing this;
i copied it into the bug report for this project:
    https://bugs.launchpad.net/scribe2/+bug/340203

The big picture is:
   (a) loader determines whether book is left-to-right
       (English) or right-to-left (Arabic, Hebrew, Korean, etc)
   (b) loader indicates this to biblio tool
   (c) scanner scans book either left-to-right or right-to-left
       as indicated by the biblio tool.

This means that the piece of info (one bit, indicating whether
it is a right-to-left book or not) has to make its way from
(b) [the loader station] to (c) [the scanner station].

There are two commonly used ways this could happen, as far as i know:
   (i) through the metadata table directly
  (ii) via the archive (as in, loader writes to the item, scanner
       reads it out).

As far as i know, all communication of this sort currently
goes through path (i).

However, to send this new information through path (i) would
take a new column in the metadata table (or a take-over of an
existing one).

Path (ii) has the problem of making a call on the item in the
archive, which adds a point of failure (although we'll do it
only once), and this point of failure affects all books
whether right-to-left or left-to-right.

However, i'm planning on doing that anyway, because adding columns
to the metadata table is so problematic.

To reduce the probable failure rate, i intend to add a configuration
flag so that this code will only execute at scancenters or on scribes
where we know right-to-left scanning will be done.

If anybody objects, or has other possibilities in mind, please
reply before about 2:00 Wednesday, 3/18/2009. (I have to be gone
Wednesday morning, but in the afternoon i plan to be here and
pursuing path (ii). The first half of Hank's plan is done, i.e.,
the pipeline is already laid from biblio to the metadata, and
getting it back out starts tomorrow afternoon i hope.)

dan

Revision history for this message
danh (danh-archive) wrote : right-to-left scandata, handedness

I've modified the scribe code to respond to right-to-left books, under
the assumption that they are scanned starting at the right "back" cover
of the book. Suppose we have a book whose pages and covers
read in left-to-right order:

blank cover 0 1 2 3 cover blank

I'll abbreviate this b c 0 1 2 3 c b.

Then in right-to-left scanning it would appear as:
   c b
   2 3
   0 1
   b c

I've set up the scandata to be written as:
  leaf contents hand rotate
   0 b left 90
   1 c right -90
   2 3 left 90
   3 2 right -90
   4 1 left 90
   5 0 right -90
   6 c left 90
   7 b right -90

Note that in left-to-right books,
right goes with 90 and left with -90,
but this makes the images upside down for
right-to-left books.

Note also that the first two columns must be
as i've written them, but the handedness
column need not be. However, if i choose
the opposite handedness, then the republisher
splits spreads, because, after all, what we
call left should be on the left. That is,
for the example above we'd get 5 spreads
instead of 4, plus some warnings.

Although i have not done it yet, i intend
to also write a flag into the scandata that
this is a right-to-left book (and use the
same notation as what is written in the *_meta.xml).

Note that this works for the 1-up pdf, but the 2-up
pdf gets the pages flipped. That i think i can fix,
although i have not done so yet.

Here's an example book---note that i have not
tried at all to make it nice, i just want to get
the ideas out:

http://www.archive.org/details/zdanh_test_017n_rl

(And note also that there's more coding to do
and checking, per Eric, to make sure we can
insert, etc.)

So: i'd like feedback on the way i did the scandata,
or anything else. In particular, i'd like to know
if this is breaking any other tools.

I haven't checked anything in yet, so the situation
is still pretty fluid. That means that right now
we can take another convention on the handedness
or anything else.

But i do intend to do something about the pdf generation,
so i'd like to know of any problems, especially if
i should do the handedness in the opposite sense.

(I had to change several pieces of code because, naturally,
there were strong underlying assumptions about how leaf
numbers map to sides and angles.)

Revision history for this message
Hank Bromley (hank-archive) wrote :

I don't know specifically that it would break anything besides the pdf (which, as you say, you could fix), but it does seem undesirable to me to reverse the handedness tags in this way. If a page appears on the left half of a spread in the original book, I'd think we'd want the scandata to say "left", not "right."

How involved would it be to fix RePublisher, or otherwise work around the difficulties it has with right-to-left books? That seems a better solution than having to do the opposite of what scandata handedness indicates when rotating the image, when making the pdf, and when doing anything else that might depend on the handedness info (the bookreader? djvu?).

Revision history for this message
danh (danh-archive) wrote : Re: [Bug 340203] Re: right to left in scribe scanning- for yiddish, korean, , ect text

Hank, thanks very much for this reply.

I was and am having a lot of grave doubts about doing the
handedness this way.

However, i think the new book reader handles it in stride---
it opens right to the "back" [rightmost] page and
you can step through it flipping the pages leftward. And
the 1-up in the new book reader also works.

The republisher prefers that the first page have a certain
handedness, but it is software, and so we can fix that i'm sure.

I have the code snapshotted, so i guess what i will do is
try it out the other way and distribute something.

dan

PS: Thanks for replying directly through the bug report
mechanism so that we can capture all of this.

> I don't know specifically that it would break anything besides the pdf
> (which, as you say, you could fix), but it does seem undesirable to me
> to reverse the handedness tags in this way. If a page appears on the
> left half of a spread in the original book, I'd think we'd want the
> scandata to say "left", not "right."
>
> How involved would it be to fix RePublisher, or otherwise work around
> the difficulties it has with right-to-left books? That seems a better
> solution than having to do the opposite of what scandata handedness
> indicates when rotating the image, when making the pdf, and when doing
> anything else that might depend on the handedness info (the bookreader?
> djvu?).
>
> --
> right to left in scribe scanning- for yiddish, korean,, ect text
> https://bugs.launchpad.net/bugs/340203
> You received this bug notification because you are a bug assignee.
>

Revision history for this message
danh (danh-archive) wrote :

Per discussion with Hank, changed the scheme back so that what is physically
on the left side is marked LEFT in the scandata, and what is physically on the
right side is marked RIGHT in the scandata. (This still requires touching several
files because the code generally, and up until now, rightfully determined the
handedness solely on the basis of the page index.)

An example is http://www.archive.org/details/zdanh_test_017q_rl
The example is kind of crummy because i managed to cut off
the odd page numbers in scanning. However the covers show what's
going on i think.

The new book reader does fine on it.

The pdf still has a problem of putting pages on the wrong side, but i think
we can deal with this (i.e., fix it) without too much heart ache.

Revision history for this message
Hank Bromley (hank-archive) wrote :

Are you sure the pdf isn't right? We had right-to-left pdfs working for the Yiddish books, and with your current version of handedness, the scandata should be pretty much the same as what we created for the Yiddish books.

Remember that, so far as we knew, only Adobe Reader respected the R2L directive, and in order for Reader to display properly, you need to have both "Two-Up" and "Show Cover Page During Two-Up" set (under View / Page Display).

I just viewed your new pdf with those settings, and it looks right to me.

Revision history for this message
mangtronix (mang) wrote :

Re your comments above (https://bugs.launchpad.net/scribe2/+bug/340203/comments/6) it looks like you're testing your RTL code with a book that's actually LTR? That's confusing enough to me that I can't verify your changes. You could use a RTL test book (or just make one with some paper, hand numbering the pages) to make it easier to follow what's going on.

The image captured on the left hand side of the bed should always be marked as "left" in the scandata. Similar for right. (This assumes that the book is not captured upside-down.)

The increasing pages indicated in the book (printed "Page 1", "Page 2", etc) should have increasing leaf numbers. This ensures that 1-up readers show the pages in sequential order.

  - mang

Revision history for this message
danh (danh-archive) wrote :

Mang, thanks for your assessment (https://bugs.launchpad.net/scribe2/+bug/340203/comments/11).

I did indeed test it with an LTR, and at first it was confusing because i thought that the book reader somehow had managed to detect that and reverse the order (but of course on reflection an LTR and an RTL should look the same except for where you start).

In any event you have a real good idea about just having numbered pages which i'll do for any future test.

Regarding the left/right scandata marking (i.e., left is physically left, right is physically right) what you (and Hank) say is probably right and certainly makes more sense in terms of any least-surprises principle. So that's what the code does now.

Revision history for this message
danh (danh-archive) wrote :

We had a meeting and adopted a scheme from Matt to deal with right-to-left
books by scanning them upside down, and then correcting the scandata during
republisher-checkin where we already modify it.

Raj wrote an outline of what the code should do, and this is it:
  (1) At processing outset write an element into the scandata
       indicating that this is a right-to-left book, and as part of
       this that the scandata modification has not yet taken place.
  (2) In RePublisher-checkin.php there already is a loop which
       goes over every page. We will write a second loop which
       goes over every page which
          swaps handside
          fixes the rotate angle
          fixes the crop box
          fixes the skew
       Note that the scandata only gives instructions for later
       processing so we won't actually have to do any processing
       at this point.
  (3) Clear the flag set in (1) indicating that scandata modification
        has not yet been done.

We need to clean out or disable the changes we've made for right-to-left scanning.

For reference, here are the version numbers and files we've changed. We've
omitted the leading 1.

33 Bimp.inc
07 checkJp2.php
34 RePublisher-scribe.php
66 RePublisher.js
04 Scribe2Finish.inc
06 Scribe2.inc
06 ScriblioConfig.inc
47 Scriblio.inc
29 scriblio.js
16 ScriblioProcess.inc
27 ScriblioXML.inc

Revision history for this message
siznax (siznax) wrote : Re: [Bug 340203] two possible paths for right-to-left info to flow into image loop, 340203

On 3/17/09 6:04 PM, danh wrote:
 > However, i'm planning on doing that anyway, because
 > adding columns to the metadata table is so problematic.

i don't believe this is true.

/<email address hidden>

Revision history for this message
danh (danh-archive) wrote : i think skew angle should be unchanged for 180 degree rotation in plane

Hi Raj,

Thanks for coming up with a plan to implement Matt's upside-down
scanning design.

For reference, Matt's idea was to turn right-to-left books
upside down and then treat them as left-to-right, with
adjustments to be made in subsequent processing. At our
meeting we came to a consensus that we'd make these adjustments
before anything hit the cluster (due to concerns both you
and Hank had).

The algorithm you suggested to me was:
  (1) Prior to republishing activity, add a flag to the
      scandata file signifying that the contents of that
      file would have to be modified.
  (2) In RePublisher-checkin.php add code to look at that
      flag, and if it was set, loop over the per-page data
      in the scandata file and:
      (a) swap handsides (left -> right, right-> left)
      (b) fix the rotate degrees (basically negate the +/- 90
          degree angles)
      (c) fix the skew
      (d) fix the crop box
  (3) Clear the flag set in step (1).

You also said that skew was applied first, then cropping, and
that the scandata functioned, at least in part, as a set of
instructions for further processing.

For further context, i believe that you told me about a year
ago that the rotation we do takes place about the center
of the image.

So i've attempted to code up your implementation (twice
in fact, as the first time i think i got it wrong).

My belief is that the skew angle actually remains
exactly the same.

That's just because if you rotate the plane 180 degrees
about any point, the slope of every line remains
unchanged (although where it lies, where it crosses
the coordinate axes, etc will all change).

Further, the coordinates of the corners of the
crop boxes all change according to the formulas
   x_new = total_width - (x_old + crop_width)
   y_new = total_height - (y_old + crop_height)
so i computed the new crop box corner according
to these formulas.

Since the scandata serves as a set of instructions
for image adjustment, if the original crop box was
correct (i.e., if it was originally correct to
rotate about the center by a certain angle, and then
crop to a certain box) the new crop box should also
be correct because it describes geometrically the
same sequence of instructions to be performed
on the image.

Here's an example of a right-to-left book (hand-made
per Mang's suggestion) where the crops are all very
non-centered but seem to survive the processing:
   http://www.archive.org/details/zdanh_test_018h_rl

(I can show the original to anybody who wants to see it.)

Here's an example of a (fake!) right-to-left book,
scanned upside down, which i think shows that the text seems
to be deskewed from the originals:
   http://www.archive.org/details/zdanh_test_018f_rl

Thanks in advance for any corrections or advice or
criticism from anybody about this.

dan

Revision history for this message
danh (danh-archive) wrote : discussion summary, state modelled, scandata, status, deficiencies
Download full text (5.4 KiB)

1. Discussion summary:

    We had a meeting Monday (3/30) to decide how to scan right-to-left
    books. Brewster, Matt, Robert, Eric, Raj, Hank, and i participated.

    There were three proposals of how we could scan:
       left-to-right and right-side-up
       left-to-right and upside-down
       right-to-left and right-side-up

    We had been pursuing the last of these using a design Hank
    came up with and described in:
       https://bugs.launchpad.net/scribe2/+bug/340203/comments/1
    We had not yet modified republisher to be able to reshoot
    spreads (the right-to-left mechanism scanned right-to-left
    and right-side-up ends up grouping the images in pairs
    offset-by-one from the way left-to-right right-side-up does).

    To avoid doing this, the decision was made to scan
    right-to-left books by going left-to-right
    and upside-down, per Matt, as described in
       https://bugs.launchpad.net/scribe2/+bug/340203/comments/13

    Hank pointed out that one advantage of scanning right-side-up
    is that it is natural and extensible. Scanners whose native
    language is right-to-left and who are scanning works in
    their own language may be able to work more easily
    right-side-up.

    Subsequently i had a private discussion with Steve who advised
    trying not to seal off a general solution, and to keep enough
    state around so that it would be possible to add another
    scanning mode if the need arose. We also discussed what
    state would be necessary to describe the scanning process,
    as well as other related questions.

2. State Modelled:

    We've attempted to leave hooks in suitable places in the
    code to record these pieces of information:
    (a) the page-progression (left-to-right or right-to-left)
    (b) the direction of scanning (left-to-right or right-to-left)
    (c) the vertical orientation (right-side-up or upside-down)

    Note that (a) represents a property of the book alone,
    while (b) is a property of the scanning process alone.
    (c) is also a property of the scanning process, but to
    determine whether a book has been scanned upside down
    you have to know something about the book's contents.

    We do not now actually use all of this state. Rather,
    as soon as we detect that the page-progression is
    right-to-left, we set (or rather, leave) the direction of
    scanning as left-to-right and set the vertical orientation
    to be upside-down. But hopefully we've left enough
    hooks around that we can add another scanning scheme
    as needed or desired.

    On the scribe end this information is now basically handled only
    by putting up an alert to the scanner that the material is
    right-to-left and should be scanned upside-down---and
    also writing out the scandata.

3. Scandata handling:

    The scandata produced by the scribe for a right-to-left book
    includes a fragment like
      <globalHandedness>
        <page-progression>rl</page-progression>
        <scanned-right-to-left>false</scanned-right-to-left>
        <scanned-upside-down>true</scanned-upside-down>
        <needs-rectification>true</needs-rectification>
      </globalHandedn...

Read more...

Revision history for this message
danh (danh-archive) wrote :

This was fixed per the activity log about a year ago, so marking it as fix-released.

Changed in scribe2:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.