converting DJVU file containing text fails

Bug #1286771 reported by Kurt Bigler on 2014-03-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

The conversion failure has DJVU input (with embedded text) and EPUB output. It is consistently repeatable.

I have repeated the attempt numerous times both on the Mac (10.6.8) and also tried on a linuxmint-12 boot that I have set up under Parallels on the Mac.

So far I have two DJVU files on this system and neither will convert. I am focusing on one of them, for which I have done extensive and careful testing.

Here is why I believe this DJVU file should convert successfully

* The file can be searched for text, e.g. in DjVuLibre desktop reader.

* The friend who sent me the DJVU file was able to convert the file to EPUB with the expected result.

He just sent me the DJVU file again after having just confirmed again that he could convert that very file, sending me the input and the result of the conversion. His first conversion was done with version 1.22 and the one he just re-sent was with 1.25. Most of my attempts were with 1.25 and I just now confirmed again with 1.26.

When he resent the file I confirmed it was binary identical to the file he had sent earlier for which I have the conversion problems.

Testing I did on the Mac was done with calibre settings I had altered a bit. Testing I did on linuxmint was done with default settings, except for my specifying my device as a standard Sony.

The EPUB file resulting from my conversion as viewed from the built-in calibre viewer, contains only the calibre-generated title page followed by a blank page. The EPUB file resulting from my friend's conversion contains the full text.

It seems pretty clear we are both converting from the same source file with some variation in options and host systems, and getting different results. His conversion does look like it is from an unformatted text source, due to the lack of any sensible page structure, etc.

I tried the conversion debug logging option. The several individual resulting index.html files are all empty. I do not know how to view the other formats present. Note that the conversion progress starts at 1% and stays there for several minutes until it suddenly goes to 100%. I did not watch continuously to see if there was an intervening percentage. But generally I am used to seeing only 1%, 67%, and 100% in other types of conversions.

I am attaching the DJVU source, the failed EPUB, the metadata stored by calibre with these files in the library (those 3 comprising the contents of the library folder) and also the successful EPUB (larger file) created by my friend, and a screenshot he made when he did the conversion.

Kurt Bigler (kkbshop) wrote :

Converting DJVU files uses the program djvutxt whichis assumed to be
located at /usr/bin/djvutxt

If that program is not found,then it falls back on an internal
implementation, which I guess is failing on your djvu file.
Since I dont maintain that internal implementation, I
cant fix it, I suggest you install djvutxt, then you will be able to
convert that djvu file.

 status wontfix

Changed in calibre:
status: New → Won't Fix
Kovid Goyal (kovid) wrote :

Actually, I took a lok at the internal djvu decoder, and fixing it
should be fairly simple, so the fix will be in the next calibre release.

Fixed in branch master. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: Won't Fix → Fix Released
Kurt Bigler (kkbshop) wrote :

Thanks, I see it is basically working now. (I will send a contribution.)

Your fix brings up a question. If the "internal implementation" now does what the /usr/bin/djvutxt implementation used to do, is there a reason the djvutxt implementation is still preferred, e.g. maybe for better potential future functionality?

Otherwise it would seem to add complexity to have two implementations. If the fallback implementation is less preferred then I should point out that, assuming djvutxt is easy to install (I never found out), it would would have served my purpose just as well to have had an error message produced stating /usr/bin/djvutxt is required.

Secondly (and this might belong in a separate bug item if pursued), it appears from the ability to search (the fact that search results hilite individual words in their correct location) that as a result calibre has a "complete picture" of the original, any OCR inaccuracies aside, e.g. such that heuristic processing should be fully functional. Yet I am not seeing section headings recognized. Rather they are wrapped into the paragraph text. In fact paragraph breaks are not recognized. (They have an approximately 2 "n" indent in the original.) Page headings are also not recognized, but being new to calibre I'm not sure what's expected. In my test case I increased the unwrap factor from 0.4 to 0.9 and a heading line which is about 55% of page width gets flowed in with the rest of the text, the paragraph above and below it combined. For due diligence I did a search for a word in that heading, and it is found and hilited at the correct location.

Should this go to the forum or to another bug report? Or are my expectations just not realistic? For a quick look I am attaching screenshots of sections of the input and resulting output that includ the heading "Taking a Speculative Philosophy Seriously?".

djvutxt is no longer used. And the djvu file has the section headings
wrapped into its main text, in fact the text in the file is all
completely undifferentiated plain text.

Kurt Bigler (kkbshop) wrote :

"in fact the text in the file is all completely undifferentiated plain text"

Plain text with at least each word referencing coordinates of a rectangle, or a list of glyph indices from which rectangles can be recovered, since it is possible as I pointed out for a search to hilite the matching words with the correct rectangular outline.

Therefore excluding unusual cases like overlapping text or text running at 90 degrees, it is possible, fairly simple I think, to (as one strategy) enhance the undifferentiated plain text so it is instead plain text with no markups but at least sensible line breaks that could be used for heuristics.

I'm fairly sure if I have the plain text with links to glyph rectangles that I could write the code to produce what I will call the enhanced plain text.

Kovid Goyal (kovid) wrote :

Again, *undifferentiated*.

Kurt Bigler (kkbshop) wrote :

Ok, so I think you are implying the problem is that the djvu text extraction is throwing away the information. Yet the information is there prior to the extraction, right? Undifferentiated text could not be used to search and hilite a rectangle in the djvu-rendered glyphs.

So maybe what I'm suggesting translates to an enhancement to the djvu text extraction. Remember I'm only seeing the system from the outside (so far) so I'm having to guess at how functionality is partitioned. That makes it a bit tricky to communicate the idea. Can you confirm that the information is there in the overall system somewhere and that some level of implementation is throwing away the relation to the glyph locations when the text is extracted?

Kovid Goyal (kovid) wrote :

I haven't a clue. All calibre does is extract text from the text
sections of a djvu file. I have no idea whether the file contains
information about text position on screen in other parts of the file,
all I can tell you is that the text sections do not contain this
information. And that is as far as my knowledge (or interest) in djvu
goes. Patches are welcome.

Kurt Bigler (kkbshop) wrote :

Thanks for the discussion.

I suspect there is something that can be done, but may still be missing some pieces of the picture.

I could go find where djvu development lives and make some initial inquiries there.

However I'm still naive about how your heuristics work and to some degree how you handle un-marked-up text (i.e. with more helpful line breaks) and especially whether there is some example of calibre handling it well, or whether "handling well" is reserved for formats that are marked up in some way.

I don't really know what the expected scope (in terms of input formats) of the "un-wrap factor" is, for example. I should probably get a better sense of those sorts of things before marching off to djvu-land with any expectations. The calibre docs on Line-unwrap factor at least sounds to me that the heuristic is expected to be usable with raw text containing hard line breaks which are to be interpreted as intentional vs the effect of flow having already been done. So I am thinking it could be helped if line breaks could be detected from glyph position and manifested as newlines in what still may be raw text. And likewise a blank line could indicate a paragraph break, as I think used to be the case in nroff/troff although it has been a couple decades since I used them. (OTOH, I might also do better by having a way to extract djvu text with some mark-up added to it.)

Does this seem reasonable? Is there anywhere I can read up to get a better sense of what calibre likes to see in its input to be most effective? It may be I need something a little beyond the user documentation.

What I would most like is to know that I have some basic sense of it, which maybe you can confirm in some way, and then just go do it (something with djvu) and not have a long belabored investigation. That way I could actually contribute something, would have time to. Otherwise there is risk I'd not get to anything useful.

Any comments appreciated. Maybe you know how it is to be an outsider on something and how hard it can be to get to the very most basics of a thing because it is all so implicit to those who know about it. I can hardly tell from out here whether I am making good guesses or not, or would need to immerse myself for a full month (which might never happen).

Kovid Goyal (kovid) wrote :

There are no heuristics applied to djvu input. You can apply the
general conversion heuristics, but those are designed for HTML, not
plain text.

If you want to do analysis of text based on rendering position, the
place to do that is in the DJVU input plugin, before calibre's
conversion pipeline is engaged.

In other words, you need to find out how to extract rendering
information from the djvu format and then modify the code in
plugins/djvu_input.py to use that information to generate the best
poossible HTML from the raw text you can. THe calibre conversion
pipeline will take care of converting the HTML into whatever output
format is specified.

Kurt Bigler (kkbshop) wrote :

Aha, that makes it all fairly well-contained.

I don't know python at all, but can probably get around that.

Thanks.

Kovid Goyal (kovid) wrote :

You dont have to write it in python. If you can write it as
a standalone (C/C++) program that outputs the text as UTF-8 encoded HTML that is
sufficient (just be sure to use a portable subset of C/C++ as it has to be
compilable with gcc 4.2 and Visual Studio 2008). The program would have
to be licensed with a GPL-3 compatible license to be included with
calibre.

Although, IMO, writing text processing in C/C++ would be a lot more work
than just learning python.

Kurt Bigler (kkbshop) wrote :

Regarding the HTML required for the calibre pipeline, any particular flavor? Simple old html will do?

And do you have any representation for page breaks in the event that page headers/footers might ultimately be detected by heuristics? Or if I detect such, should I mark headers/footers with any particular styles, etc.?

***

Incidentally I see that in the example I've been playing with the geometry info is present at the word level via djvutxt -detail. I got that tip from an initial inquiry I made at the DjVuLibre project (discussion) on sourceforge.

Kurt-Biglers-iMac:~ kurt$ djvutxt -detail \[Isabelle_Stengers\]_thinking\ with\ whitehead.djvu | head -15
()
(page 0 0 2864 4937
  (line 132 4358 2724 4492 (word 132 4358 990 4492 "THINKING")
    (word 1118 4362 1556 4490 "WITH")
    (word 1684 4362 2724 4490 "WHITEHEAD") )
  (line 324 3936 2534 4052 (word 324 3962 410 4050 "A")
    (word 468 3960 702 4048 "Free")
    (word 740 3960 964 4052 "and")
    (word 1016 3958 1276 4052 "Wild")
    (word 1326 3958 1808 4050 "Creation")
    (word 1858 3936 2010 4050 "of")
    (word 2024 3936 2534 4050 "Concepts") )
  (line 342 3094 1630 3186 (word 342 3094 904 3186 "ISABELLE")
    (word 972 3094 1630 3186 "STENGERS") )
  (line 336 2814 1576 2906 (word 336 2834 786 2906 "Translated")
Kurt-Biglers-iMac:~ kurt$

Kovid Goyal (kovid) wrote :

It is alwys best to output valid XHTML that way you can be sure that the
rest of the pipeline will parse the html correctly, but the pipeline is
perfectly capable of handling tag soup.

If you identify headers and footers, I suggest removing them entirely,
they have no place in ebook formats.

If you want to specify a page break in HTML, simply use the standard CSS
page-break-before/after properties.

Kurt Bigler (kkbshop) wrote :

(Should this conversation continue in a different context?)

The results of what I do will end up in the public and your domain. Given your experience you may some inputs on the following.

For extracting the geometry from a DjVu file it appears there are at least 3 possible basic approaches.

* Use command-line djvutxt -detail per my post above and then locate (rather than re-invent) an S-expression parser to get access to the page/line/word info.

* There are also DjVu XML tools which can also extract text and geometry. I have not investigated but it sounds like functionality is similar to djvutxt -detail. Public XML parsers and bindings for them are perhaps more readily available than S-expression parsers.

* There is also a DjVu API which provides similar functionality and apparently will allow gathering the same info in an (automatically) multi-threaded implementation. But not sure if it makes sense to have multiple thread running on different pages of the same DjVu input. I haven't checked whether there are python bindings for the API. If not, it probably means "good ol' C" and even if I'm permitted that approach, it may have implications for ultimate maintainability.

I can look into all that more but will appreciate preliminary thoughts. I also don't mind benefiting from your experience in ways that will make my life easier! But also the net performance of calibre (with DjVu input) may be affected by these choices.

Also I probably should read up on the cailbre development scenario. But if I develop a C plugin on the Mac am I accepting responsibility for a certain set of ports (e.g. to Windows and Ubuntu)? You mentioned portability but that is in theory until it is becomes a fact. I'm not a seasoned open-source developer, or what I've done before was in a more contained scenario. I don't want to be lazy but the inputs help a lot! Thanks.

Kovid Goyal (kovid) wrote :

I suggest either of the two following approaches:

1) Add some code to calibre to directly read the position data from the
djvu file (from what little I have seen of djvu it seems to be afairly
simple format)

2) Use the djvu xml producing tool to output XML and use the lxml
package (part of calibre) to parse it.

The only things you will have to check is that the djvu XML tool is not
too large and is, at least, nominally, compilable on all three platforms
and has a compatible license. I can take care of compiling it on the
other platforms as part of the calibre build process, provided that it
is nominally buildable and comes with some build scripts for the
different platforms.

1) is a bit more work, but is prefereable, since I suspect that using
the djvu XML tool will mean adding a very large block of code into
calibre for what is a fairly simple job.

If I am wrong about the djvu format being relatively simple, then (2)
becomes prefereable.

You can also start out with (2) and switch to (1) at the end once, you
have finished the layout detection work, that way you dont need to put in
much upfront work before implementing the most difficult part of the
process.

You can use either a C or python implementation for (1). Most likely
python will be fine, since I doubt this bit will be performance critical
(the current DJVU plugin has a C implemntation for the decompressor,
since implemnting that in python is very slow, but everything else is in
python).

If you wish to move this discussion to email, you will find my email
address all over the calibre source code.

Kovid Goyal (kovid) wrote :

The heuristics that apply to HTML do not render the HTML, if you hover
your mouse over that optionin the conversion dialog, it will tell you
that it does so using punctuation and other clues. Therefore, using
layout information has to be done in the input plugin stage.

Indeed, the unwrap heuristic trnasformation is only ppresent at all to
deal with content that is already in HTML but has hard line breaks
because of a previous poor conversion or because the HTML was
deliberately formatted that way by some misguided person.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers