Beautiful Soup hangs when using the lxml parser under Apache mod_wsgi

Reported by RAM on 2012-03-06
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

I just found this problem with bs4 4.0.0b10 and the lxml parser (running in a virtualenv) under python2.7 and wsgi: Say you have the following WSGI application script

#! /usr/bin/env python
import os, sys, site

site.addsitedir('/path/to/virtual/env/lib/python2.7/site-packages/')

root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
projet_dir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
sys.path.append(root_dir)
sys.path.append(projet_dir)

os.environ['DJANGO_SETTINGS_MODULE'] = 'project.settings'

#from django.core.handlers.wsgi import WSGIHandler
#application = WSGIHandler()

from bs4 import BeautifulSoup

def application(environ, start_response):
    status = '200 OK'
    html = '<html><head><title>hi</title></head><body>hi again</body></html>'
    soup = BeautifulSoup(html, 'lxml')
    output = 'if you see this then it's ok'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

then, on each web request that goes through this handler, BeautifulSoup will never respond (or will take very very long, i.e. > 1min) to parse even the simplest of the html strings. The exact part that fails is this line in bs4/builder/_lxml.py

    def feed(self, markup):
        self.parser.feed(markup) <-- this one
        self.parser.close()

which in my case corresponds to a lxml.etree.HTMLParser object.

I thought about possible problems with other dependencies in my project, but it's not the case since the handler script I just shared here only loads bs4 beyond os, sys and site). I also tried to replicate the problem without calling bs4, with this wsgi handler

#! /usr/bin/env python
import os, sys, site

site.addsitedir('/path/to/virtual/env/lib/python2.7/site-packages/')

root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
projet_dir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
sys.path.append(root_dir)
sys.path.append(projet_dir)

os.environ['DJANGO_SETTINGS_MODULE'] = 'project.settings'

#from django.core.handlers.wsgi import WSGIHandler
#application = WSGIHandler()

from lxml import etree
from bs4 import BeautifulSoup

def application(environ, start_response):
    status = '200 OK'
    html = '<html><head><title>hi</title></head><body>hi again</body></html>'
    ob = {}
    parser = etree.HTMLParser(target=ob, strip_cdata=False, recover=True)
    parser.feed(html)
    output = 'it works muahaha!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

And it works! I decided to import etree first and then keep importing bs4 with the hope that somehow bs4 did some weird override of lxml's etree import, but no luck. It worked perfectly.

So the problems seems to be with the instance of lxml.etree.HTMLParser called from bs4. Other parsers like html.parser works perfectly (but it's not an option for my project).

BTW, the error reported by apache in the error log is the typical

[Tue Mar 06 18:11:25 2012] [error] [client 201.214.88.158] Premature end of script headers: script.wsgi

usually related with problems with EXPAT as noted here http://code.google.com/p/modwsgi/wiki/IssuesWithExpatLibrary. But I followed the steps suggested on that doc and found that there were no problems with pyexpat.

Apache version is

Server version: Apache/2.2.17 (Ubuntu)

under Ubuntu

Linux project 2.6.38-8-virtual #42-Ubuntu SMP Mon Apr 11 04:06:34 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Outside WSGI there's no problem. I tried in the django shell, to keep django around in case it was a problem with it instead of WSGI

$ python manage.py shell
>>> from bs4 import BeautifulSoup
>>> html = '<html><head><title>hi</title></head><body>hi again</body></html>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.find_all('title')
[<title>hi</title>]

(the same happens by simply calling python with no django involved).

It's the same html snippet I used to test under WSGI, and there it fails with any html doc, big or short ones.

Leonard Richardson (leonardr) wrote :

I'm able to reproduce. Some more information:

* The presence of a virtual environment doesn't matter.
* The reference Python WSGI environment does not have the problem. Apache mod_wsgi does.

I believe this is an instance of the interaction between mod_wsgi and Cython described here:

https://techknowhow.library.emory.edu/blogs/branker/2010/07/30/django-lxml-wsgi-and-python-sub-interpreter-magic

"Unfortunately, given that Cython-based libraries are incompatible with sub-interpreters, and given that mod_wsgi uses sub-interpreters, it follows logically that Cython-based libraries like lxml are incompatible with simple mod_wsgi configurations. In our case, this manifested as a single-thread self-deadlock in the Python Global Interpreter Lock whenever we tried to use our application at all. "

The simple workaround given in that page (forcing a single WSGI application into the global application group) resolves the problem for me. I don't know anything about the mod_wsgi daemon processes mentioned as an alternative, so I didn't try them. But I think this is pretty strong evidence that the problem is a conflict between two non-Beautiful-Soup pieces of software. As such, I'm marking this bug WONTFIX.

summary: - bs4 4.0.0b10 with the lxml parser under python2.7 fails inside wsgi
- environments
+ Beautiful Soup hangs when using the lxml parser under Apache mod_wsgi
Changed in beautifulsoup:
status: New → Won't Fix
RAM (ralamosm) wrote :

Yes, this is very strong evidence that the problem I found yesterday is this one, however, I'm still wondering if this is the exact problem.

You see, I'm running my app on a mod_wsgi daemon (virtualhost conf at the end if it's useful for you to test it as well) and the bug keeps happening so the solution they suggested doesn't work. In the other hand, as you may see in the report, my problem only happens when using lxml through BeautifulSoup but if I call lxml.etree.HTMLParser directly from my wsgi handler it works perfectly (maybe because I'm precisely running inside a mod_wsgi daemon process).

Any idea why this might be happening? Maybe bs4, in some way, triggers these sub-interpreters conflicts or smth, even inside an isolated mod_wsgi daemon process...

Anyway, thanks a lot again! Going back to BeautifulSoup 3 solves the problem (obviously, because lxml is not involved), although I'd have loved to use lxml as the parser with bs4.

--- my vhost ---
<VirtualHost 127.0.0.1:80>
        ServerAdmin webmaster@project
        ServerName subdomain.project
        ServerAlias www.subdomain.project

        WSGIDaemonProcess subdomain.project processes=2 threads=25
        WSGIProcessGroup subdomain.project

        WSGIScriptAlias / /path/to/project/script.wsgi
        Alias /static/ /path/to/project/static/

        <Directory /path/to/project>
                Options Indexes FollowSymLinks MultiViews ExecCGI
                AddHandler cgi-script .cgi
                AllowOverride None

                AuthType Basic
                AuthName "Authorization Realm"
                AuthUserFile /path/to/pass
                Require valid-user

                Order allow,deny
                allow from all
        </Directory>

        ErrorLog ${APACHE_LOG_DIR}/error_subdomain.project.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        CustomLog ${APACHE_LOG_DIR}/access_subdomain.project.log combined
</VirtualHost>

Leonard Richardson (leonardr) wrote :

Beautiful Soup invokes the constructor HTMLParser(target=self, strip_cdata=False), with "self" being the LXMLTreeBuilder object. I have three hypotheses:

1. lxml runs a different code path if it's sending events into a custom target object than if it was creating a tree on its own, and the lock only happens under this other code path.

2. lxml always runs the same code path, but the default target object is written in Cython. When Beautiful Soup specifies a Python target object, execution switches rapidly back and forth between Cython and Python code, creating lots of opportunities for things to go wrong.

3. (unlikely) There's something really awful about strip_cdata=False, and if you remove that it will work.

Jeewon Jang (jeewonjang) wrote :

We were having the same problem using BeautifulSoup4 with mod_wsgi. But we didn't have to option of leaving BeautifulSoup at version 3, because of the added functionalities in version4 regarding how it handles empty comments in the html. But then when I tried detecting for mod_wsgi, catching, and using BeautifulSoup with the html5 parser, it worked!

Here's the diff:

(pynliner/__init__.py -- starting at line 107)

     def _get_soup(self):
         """Convert source string to BeautifulSoup object. Sets it to self.soup.
+
+ If using mod_wgsi, use html5 parsing to prevent BeautifulSoup incompatibility.
- self.soup = BeautifulSoup(self.source_string)
+ # Check if mod_wsgi is running - see http://code.google.com/p/modwsgi/wiki/TipsAndTricks
+ try:
+ from mod_wsgi import version
+ self.soup = BeautifulSoup(self.source_string, "html5lib")
+ except:
+ self.soup = BeautifulSoup(self.source_string)

     def _get_styles(self):

This way BeautifulSoup can still handle the special case when mod_wsgi is present, and can avoid the behavior with lxml.

Hope this helps!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers