bs4 crashes on certain pages when using lxml as a parser

Bug #1558080 reported by b0r3d0m
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8:

from bs4 import BeautifulSoup

with open('page.html', 'r') as f:
    content = f.read()
    for i in xrange(1000000000):
        print(i)
        soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly

As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report.

==================================

There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "etree.pyd"):

Problem signature:
  Problem Event Name: APPCRASH
  Application Name: python.exe
  Application Version: 0.0.0.0
  Application Timestamp: 56634a05
  Fault Module Name: etree.pyd
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp: 56470805
  Exception Code: c0000005
  Exception Offset: 0011e3fa
  OS Version: 6.2.9200.2.0.0.768.100
  Locale ID: 1033
  Additional Information 1: 5861
  Additional Information 2: 5861822e1919d7c014bbb064c64908b2
  Additional Information 3: dac6
  Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1

Read our privacy statement online:
  http://go.microsoft.com/fwlink/?linkid=190175

If the online privacy statement is not available, please read our privacy statement offline:
  C:\Windows\system32\en-US\erofflps.txt

==================================

Moreover I noticed that the following code doesn't crash at all:

from lxml import etree

with open('page.html', 'r') as f:
    content = f.read()
    for i in xrange(1000000000):
        print(i)
        tree = etree.HTML(content)

==================================

lxml versions -- 3.4.4 and 3.5.0
BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)

Tags: crash
Revision history for this message
b0r3d0m (nikita-trophimov) wrote :
description: updated
Revision history for this message
john (johnandersenpdx) wrote :

Could this be because you opened 'r' instead of 'rb'? This is py2 and that is Unicode so there could be issues from that, just a possiblilty

Revision history for this message
Leonard Richardson (leonardr) wrote :

Unfortunately I don't have a Windows machine and I can't duplicate this on Linux. It looks like a problem with lxml. There are a number of open issues where Beautiful Soup seems to exercise buggy behavior in lxml but only on platforms other than Linux:

bug 1520000 - "Inconsistent Results During Nested Find Operation"
bug 1438111 - "FindAll method returns doubled results"

bug 1417011 - "lxml parser behaves incorrectly in Windows"
bug 1558080 - "bs4 crashes on certain pages when using lxml as a parser" (this bug)

bug 1471485 - "XML+Unicode: Certain input strings fail to fully parse"

Changed in beautifulsoup:
status: New → Incomplete
Revision history for this message
Isaac Muse (facelessuser) wrote :

I ran the above test on Windows for about 10000 iterations and gave up as I don't have time to run it for a billion iterations. If this issue existed at one time, it most likely doesn't exist now, or is not practical to for BeautifulSoup to hunt down as it is most likely not related to BeautifulSoup itself. I would argue it was either a bug in Python or the lxml feed API as the lxml test above does not directly test chunking as the BeautifulSoup does.

As this test is years old, I would consider closing it and seeing if it resurfaces in a practical way to test for.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for looking into this, Isaac.

Changed in beautifulsoup:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.