Comment 14 for bug 1081104

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Use of libreoffice at the command line is fairly straightforward. I used a small .docx file sent to me by my daughter's preschool to test, as well as an NDA sent to me by a client. LibreOffice is noticeably slower than doctotext. Here are some sample times, first for doctotext:

crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.004s
sys 0m0.004s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.014s
user 0m0.012s
sys 0m0.000s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.004s
sys 0m0.004s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.000s
sys 0m0.008s

And now for LibreOffice:

crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.479s
user 0m0.376s
sys 0m0.072s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.464s
user 0m0.376s
sys 0m0.064s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.452s
user 0m0.388s
sys 0m0.040s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.476s
user 0m0.360s
sys 0m0.080s

As you can see, the doctotext performance is very good, usually no more than about 15 seconds of real time. LibreOffice performance, by comparison, is marginal, adding about half a second per document for any request that needs it. Given performance that we already put up with, it's not horribly out of bounds, but it doesn't give me a warm fuzzy feeling either. I have a fairly zippy laptop with an SSD, so probably performance on one of Gocept's VMs would be worse. We should probably have Gocept install LibreOffice on a staging or dev server so we can look at what the real world performance might be.

I also did use diff to look at quality of output. LibreOffice seems to do better than doctotext, but not in any way that impacts search. LibreOffice does a better job of preserving whitespace formatting and for some reason doctotext is discarding numbers from enumerated lists. So for example, if I have an enumerated list in my document, LibreOffice gives:

1. One thing
2. Another thing
3. And a third thing

Where doctotext yields:

One thing
Another thing
And a third thing

Given that our goal is to extract text to feed to full text search, this is not a relevant difference. I did not notice any differences in output that seemed like they might be relevant for search.