Investigate switching from doctotext to a supported extractor

Bug #1081104 reported by Alexander Bittner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Tres Seaver

Bug Description

Hi!

we see a lot of errors of the following form in our KARL error monitor:

Tue Nov 20 05:32:47 2012 ERROR mailin Error converting file /tmp/tmp46IR1W

Error converting file /tmp/tmp46IR1W

Traceback (most recent call last):
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/content/models/adapters.py", line 116, in _extract_file_data
    mimetype=context.mimetype)
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/doc.py", line 39, in convert
    return self.execute('doctotext "%s"' % filename), 'utf-8'
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/baseconverter.py", line 54, in execute
    close_fds=True)
  File "/usr/lib/python2.6/subprocess.py", line 623, in __init__
    errread, errwrite)
  File "/usr/lib/python2.6/subprocess.py", line 1141, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

We see this in all sub-projects (Ariadne, Oxfam, Privacy International) and it always affects files written to /tmp/somefile.

Best regards,
Alex

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Duplicate of 1080642

Changed in karl3:
milestone: none → m999
status: New → Invalid
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Oops, not a duplicate. Carlos, I'll assign this one and a similar one to you.

Changed in karl3:
assignee: nobody → Carlos de la Guardia (cguardia)
importance: Undecided → Medium
milestone: m999 → m122
status: Invalid → New
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m122 → m123
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Unfortunately OSF's budget is gone, so no more bug fixing this year. I'll put this in the "M124" pile that means, things we'll do when we start up again.

Changed in karl3:
milestone: m123 → m124
Revision history for this message
Alexander Bittner (yb) wrote :

Hello,

we just want to let you know that this bug still affects us. Fortunately, it does not appear to affect user experience; at least, we haven't received any user complaints until now :-)

Our main problem with this one is that it makes us kind of "blind" w.r.t. our error monitor check, since this error occurs many times a day. So regardless of OSF's budget, we would like to have this bug fixed.

Chris, Paul, could you help us classifying this bug and determine whether it can be/needs to be fixed or at least be handled in a way that it is ignored by the error monitor? W.r.t billing, could you provide us with a quote? We then still can decide whether to wait for a new budget from OSF, pay it ourselves (gocept) or bill it to our customer(s), respectively.

Thanks in advance,
Alex

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Subscribing Chris to this.

I'd say one thing you could do to try and recreate it is, upload a Word file on that instance and see if it gets indexed and see if the error gets triggered. My guess is, this is a "Heisenbug", one that we won't be able to recreate. We could perhaps put a little more forensics in the exception logging to see more about the actual files that are triggering the issue, though that might not be too valuable.

At a minimum, we could change it from an error that triggers the "big" alarm, into something that just logs a warning.

Revision history for this message
Alexander Bittner (yb) wrote :

Hi Paul,

this one can be closed.

We found the casue: the baseconverter (in this case) tried to convert a file using doctotext. Unfortunately, doctotext was not callable although it was present on the machine. The reason for that was that it could not be found in $PATH since it was compiled in the service user's home directory.

So the real bug was not that the file (in /tmp) was not there but that the subprocess call did not find the executable (doctotext).

Best regards,
Alex

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Thanks for the detailed and helpful response, Alex.

Changed in karl3:
status: New → Invalid
Revision history for this message
Tres Seaver (tseaver) wrote :

This bug showed on OSF's production KARL over the weekend.

Changed in karl3:
status: Invalid → Confirmed
assignee: Carlos de la Guardia (cguardia) → Tres Seaver (tseaver)
Tres Seaver (tseaver)
Changed in karl3:
milestone: m124 → m125
Revision history for this message
Tres Seaver (tseaver) wrote :

Looks to me as though gocept reset the crontab for 'osfkarl' in
production on 2013-02-13, and wiped out whatever used to put
'/srv/osfkarl/bin' on the path for the cronjobs.

Note that these errors are clogging up OSF's error monitor.

Changed in karl3:
assignee: Tres Seaver (tseaver) → Alexander Bittner (yb)
milestone: m125 → none
milestone: none → m125
Tres Seaver (tseaver)
Changed in karl3:
importance: Medium → High
Revision history for this message
Tres Seaver (tseaver) wrote :

I looked into this yesterday. The 'osideploy' stage script tries
to jam the 'doctotext' executable into the service_user's 'bin'
directory. The process is fragile:

- It used to copy things into '~'; since Fabric does not run
  'sudo -i', the $HOME variable remained the real user's homedir,
  rather than the service_user's. I pushed a couple of commits
  fixing that today.

- It copies the 'doctotext' binary into '$service_user/bin' and
  adds that directory to 'PATH' in '$service_user/.bash_profile'.

- In addition to copying the 'doctotext' executable into the
  '$service_user/bin' directory, it copies some shared libs into
  '$service_user/lib', and then adds that directory to
  'LD_LIBRARY_PATH' in '$service_user/.bash_profile'.

The real breakage is this: Since '$service_user/.bash_profile'
only gets sourced during login (or the equivalent 'sudo -i'), it
won't be used when supervisor is started via Fabric.

I did some work yesterday trying to move 'doctotext' (version
0.13) into the bulidout. It's build is odd enough (no
'configure' script, weird builds of 3rd party stuff) that I gave
up after some hacking. I unpacked the 0.13 tarball, made it a
Git repository, and pushed it to the 'karlproject' Github
organization:

 https://github.com/karlproject/doctotext

While trying to get it to build, I reported some bugs against the
package on AMD64:

- "Linker failures due to missing '-fPIC'" (Ubuntu 12.04 x86_64),
  http://silvercoders.com/mantis/view.php?id=1230

- "Make failures due to unexpanded braces" (Ubuntu 12.04 x86_64),
  http://silvercoders.com/mantis/view.php?id=1231

I tried building it in my homedir on one of the gocept Gentoo
boxes, and ran into other weird errors.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Get with gocept and find out the command-line extractor they support. Christian showed it to me at Pycon. It is something in openoffice that doesn't do the fragile server-mode PyUNO. thing. It starts up pretty fast.

Once you find out from them the package, compare the text extracted by doctotext vs. this new thing, compare startup time, and compare complexity. Give a recommendation on switching to something supported.

Changed in karl3:
assignee: Alexander Bittner (yb) → Chris Rossi (chris-archimedeanco)
importance: High → Medium
summary: - [Ariadne] mailin Error converting file
+ Investigate switching from doctotext to a supported extractor
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Theune says what he showed you was LibreOffice, using the CLI. I'll play with it and see how it does. Seems to be a very heavy weight tool, a big process needs to load before it can do anything. So we might be looking at long extract times that would require us to use a queue and a separate thread or process to do the text extraction offline, so we don't slow down user HTTP requests. I

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1081104] Investigate switching from doctotext to a supported extractor

I think we're not going to consider anything that requires a server process. I had the impression when watching it run that it started pretty fast.

--Paul

On Mar 27, 2013, at 3:34 PM, Chris Rossi <email address hidden> wrote:

> Theune says what he showed you was LibreOffice, using the CLI. I'll
> play with it and see how it does. Seems to be a very heavy weight tool,
> a big process needs to load before it can do anything. So we might be
> looking at long extract times that would require us to use a queue and a
> separate thread or process to do the text extraction offline, so we
> don't slow down user HTTP requests. I
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/1081104
>
> Title:
> Investigate switching from doctotext to a supported extractor
>
> Status in KARL3:
> Confirmed
>
> Bug description:
> Hi!
>
> we see a lot of errors of the following form in our KARL error
> monitor:
>
> Tue Nov 20 05:32:47 2012 ERROR mailin Error converting file
> /tmp/tmp46IR1W
>
> Error converting file /tmp/tmp46IR1W
>
> Traceback (most recent call last):
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/content/models/adapters.py", line 116, in _extract_file_data
> mimetype=context.mimetype)
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/doc.py", line 39, in convert
> return self.execute('doctotext "%s"' % filename), 'utf-8'
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/baseconverter.py", line 54, in execute
> close_fds=True)
> File "/usr/lib/python2.6/subprocess.py", line 623, in __init__
> errread, errwrite)
> File "/usr/lib/python2.6/subprocess.py", line 1141, in _execute_child
> raise child_exception
> OSError: [Errno 2] No such file or directory
>
>
> We see this in all sub-projects (Ariadne, Oxfam, Privacy International) and it always affects files written to /tmp/somefile.
>
>
> Best regards,
> Alex
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1081104/+subscriptions

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :
Download full text (3.3 KiB)

Use of libreoffice at the command line is fairly straightforward. I used a small .docx file sent to me by my daughter's preschool to test, as well as an NDA sent to me by a client. LibreOffice is noticeably slower than doctotext. Here are some sample times, first for doctotext:

crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.004s
sys 0m0.004s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.014s
user 0m0.012s
sys 0m0.000s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.004s
sys 0m0.004s
crossi@spirit:~/download$ time doctotext TLG\ Garden\ Rules.docx > test.txt
Using ODF/OOXML parser.

real 0m0.012s
user 0m0.000s
sys 0m0.008s

And now for LibreOffice:

crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.479s
user 0m0.376s
sys 0m0.072s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.464s
user 0m0.376s
sys 0m0.064s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.452s
user 0m0.388s
sys 0m0.040s
crossi@spirit:~/download$ time libreoffice --convert-to txt:Text --headless TLG\ Garden\ Rules.docx
convert /home/crossi/download/TLG Garden Rules.docx -> /home/crossi/download/TLG Garden Rules.txt using Text
Overwriting: /home/crossi/download/TLG Garden Rules.txt

real 0m0.476s
user 0m0.360s
sys 0m0.080s

As you can see, the doctotext performance is very good, usually no more than about 15 seconds of real time. LibreOffice performance, by comparison, is marginal, adding about half a second per document for any request that needs it. Given performance that we already put up with, it's not horribly out of bounds, but it doesn't give me a warm fuzzy feeling either. I have a fairly zippy laptop with an SSD, so probably performance on one of Gocept's VMs would be worse. We should probably have Gocept install LibreOffice on a staging or dev server so we can look at what the real world performance might be.

I also did use diff to look at quality of output. LibreOffice seems to do better than doctotext, but not in any way that impacts search. LibreOffice does a better job of preserving whitespace formatting and for some reason doctotext is discarding numbers from enumerated lists. So for example, if I have an enumerated list in my document, LibreOffice gives:

1. One thing
2. Another thing
3. And a third thing

Where doctotext yields:

One thing
Another thing
And a third t...

Read more...

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

15 *milliseconds* for doctotext. Pardon the typo. Launchpad has now way to edit comments. Boo.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

FWIW, a thread for text extraction and indexing would not be hard to pull off.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Chris, can you run it on a gocept server and get times after it has run once?

Changed in karl3:
status: Confirmed → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Christian Theune has told me there will be an installation we can test with soon. I'm assigning this ticket to him so he can let us know when it's ready.

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Christian Theune (ct-gocept)
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Sounds good.

--Paul

On Mar 28, 2013, at 1:33 PM, Chris Rossi <email address hidden> wrote:

> Christian Theune has told me there will be an installation we can test
> with soon. I'm assigning this ticket to him so he can let us know when
> it's ready.
>
> ** Changed in: karl3
> Assignee: Chris Rossi (chris-archimedeanco) => Christian Theune (ct-gocept)
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/1081104
>
> Title:
> Investigate switching from doctotext to a supported extractor
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Hi!
>
> we see a lot of errors of the following form in our KARL error
> monitor:
>
> Tue Nov 20 05:32:47 2012 ERROR mailin Error converting file
> /tmp/tmp46IR1W
>
> Error converting file /tmp/tmp46IR1W
>
> Traceback (most recent call last):
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/content/models/adapters.py", line 116, in _extract_file_data
> mimetype=context.mimetype)
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/doc.py", line 39, in convert
> return self.execute('doctotext "%s"' % filename), 'utf-8'
> File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/utilities/converters/baseconverter.py", line 54, in execute
> close_fds=True)
> File "/usr/lib/python2.6/subprocess.py", line 623, in __init__
> errread, errwrite)
> File "/usr/lib/python2.6/subprocess.py", line 1141, in _execute_child
> raise child_exception
> OSError: [Errno 2] No such file or directory
>
>
> We see this in all sub-projects (Ariadne, Oxfam, Privacy International) and it always affects files written to /tmp/somefile.
>
>
> Best regards,
> Alex
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1081104/+subscriptions

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Just for the record, Tres re-mentioned LibreOffice's quickstart mode, where it leaves a little server process around. I'm allergic to that as I read (albeit a long time ago) that it was pretty fragile. He said, though, that if supervisor could be taught to just kill the thing periodically, perhaps it worth thinking about.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I'll look into quickstart mode.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Theune wrote:

"""
ok, we reviewed the libreoffice situation. The following command took 8 seconds
on a VM on a medium-powered host:

soffice --headless --convert-to txt:Text Windows\ Update\ Explained.docx
That was a 12 page document. If libreoffice has already been run recently this
time goes down to 1 second.

Chris Rossi also recommended that we just look into deploying the static
compiled package from the doctotext guys. Which in turn causes me to ask: why
are you guys compiling it for your setup and not just use the static compiled
package there?

I played around with compiling it and was very turned off by the fact that we
would have to do makefile-editing and that the makefile thing starts
downloading its dependencies from the internet "randomly". :(
Wanna chat?
"""

...and Tres replied:

"""
1 second should be fine. I think there is a '--quickstart' option to
start the process backgrounded (I don't know whether that would work with
supervisord, for instance).

The actual 'osideploy' code puts that static copy in
/src/<servicename>/bin. I played with trying to add it instead to the
buildout, but without success.

In the end, we just need to get that binary on the path somehow. I don't
much care how, as long as all the relevant services (karl, mailin) and
cronjobs (indexing?) have it available.
"""

Changed in karl3:
milestone: m125 → m126
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Theune suggested:

We found that doctotext provides a wrapper script that sets up the binary path.
Using this would make sense for us since this also allows for abstracting
platform-specific settings (e.g. for OS X) away.

We think we got a reasonable setup that can be used with setting PATH but
leaving LD_LIBRARY_PATH as of this:

https://bitbucket.org/gocept/multikarl.batou/src/e23948647defa4dc771f6adcb917ea951d1ca877/components/ariadne/component.py?at=default#cl-109

and

https://bitbucket.org/gocept/multikarl.batou/src/e23948647defa4dc771f6adcb917ea951d1ca877/components/ariadne/doctotext.sh?at=default

Changed in karl3:
assignee: Christian Theune (ct-gocept) → Tres Seaver (tseaver)
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Documenting a note from Theune:

Quick update from our side. We're going to start adding $HOME/bin and
$HOME/.local/bin to login shells automatically soon.

For cron environments we suggest adding BASH_ENV=/etc/bash/bashrc to get the
same behaviour - unfortunately there is no way (except starting to interfer
deeply with cron) to get this consistent.

Also, from our perspective, putting a wrapper script into ~/bin works nicely
and lets us keep everything else from the binary package in a single sandbox.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Tres, I verified that all production/staging app servers:

- Have ~/bin/doctotext

- The appropriate user account has ~/bin in the PATH as the first hop

- When logged in as the appropriate user, 'cd /tmp; which doctotext' finds the one in ~/bin

- ~/bin/doctotext --help executes

I do worry that, despite all this, there were errors in the error monitor for production yesterday about the path to 'doctotext'. There is a slim chance it was whining "No such file or directory" on the file being provided to doctotext, not doctotext itself. Perhaps we can make a change that better captures the error.

I'd like us to re-evaluate this after today's production update:

- When fixing the cron jobs to do digest, let's also do the step Theune suggested about BASH_ENV=/etc/bash/bashr

- We try uploading a .docx file and see if there's an error

- If not, we close this ticket and call it a day

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

No errors since last update, I'm declaring victory on this one.

Changed in karl3:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.