Beautiful Soup hangs when using the lxml parser under Apache mod_wsgi
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
I just found this problem with bs4 4.0.0b10 and the lxml parser (running in a virtualenv) under python2.7 and wsgi: Say you have the following WSGI application script
#! /usr/bin/env python
import os, sys, site
site.addsitedir
root_dir = os.path.
projet_dir = os.path.
sys.path.
sys.path.
os.environ[
#from django.
#application = WSGIHandler()
from bs4 import BeautifulSoup
def application(
status = '200 OK'
html = '<html>
soup = BeautifulSoup(html, 'lxml')
output = 'if you see this then it's ok'
response_
start_
return [output]
then, on each web request that goes through this handler, BeautifulSoup will never respond (or will take very very long, i.e. > 1min) to parse even the simplest of the html strings. The exact part that fails is this line in bs4/builder/
def feed(self, markup):
which in my case corresponds to a lxml.etree.
I thought about possible problems with other dependencies in my project, but it's not the case since the handler script I just shared here only loads bs4 beyond os, sys and site). I also tried to replicate the problem without calling bs4, with this wsgi handler
#! /usr/bin/env python
import os, sys, site
site.addsitedir
root_dir = os.path.
projet_dir = os.path.
sys.path.
sys.path.
os.environ[
#from django.
#application = WSGIHandler()
from lxml import etree
from bs4 import BeautifulSoup
def application(
status = '200 OK'
html = '<html>
ob = {}
parser = etree.HTMLParse
parser.
output = 'it works muahaha!'
response_
start_
return [output]
And it works! I decided to import etree first and then keep importing bs4 with the hope that somehow bs4 did some weird override of lxml's etree import, but no luck. It worked perfectly.
So the problems seems to be with the instance of lxml.etree.
BTW, the error reported by apache in the error log is the typical
[Tue Mar 06 18:11:25 2012] [error] [client 201.214.88.158] Premature end of script headers: script.wsgi
usually related with problems with EXPAT as noted here http://
Apache version is
Server version: Apache/2.2.17 (Ubuntu)
under Ubuntu
Linux project 2.6.38-8-virtual #42-Ubuntu SMP Mon Apr 11 04:06:34 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
Outside WSGI there's no problem. I tried in the django shell, to keep django around in case it was a problem with it instead of WSGI
$ python manage.py shell
>>> from bs4 import BeautifulSoup
>>> html = '<html>
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.find_
[<title>hi</title>]
(the same happens by simply calling python with no django involved).
It's the same html snippet I used to test under WSGI, and there it fails with any html doc, big or short ones.
I'm able to reproduce. Some more information:
* The presence of a virtual environment doesn't matter.
* The reference Python WSGI environment does not have the problem. Apache mod_wsgi does.
I believe this is an instance of the interaction between mod_wsgi and Cython described here:
https:/ /techknowhow. library. emory.edu/ blogs/branker/ 2010/07/ 30/django- lxml-wsgi- and-python- sub-interpreter -magic
"Unfortunately, given that Cython-based libraries are incompatible with sub-interpreters, and given that mod_wsgi uses sub-interpreters, it follows logically that Cython-based libraries like lxml are incompatible with simple mod_wsgi configurations. In our case, this manifested as a single-thread self-deadlock in the Python Global Interpreter Lock whenever we tried to use our application at all. "
The simple workaround given in that page (forcing a single WSGI application into the global application group) resolves the problem for me. I don't know anything about the mod_wsgi daemon processes mentioned as an alternative, so I didn't try them. But I think this is pretty strong evidence that the problem is a conflict between two non-Beautiful-Soup pieces of software. As such, I'm marking this bug WONTFIX.