lxml crashes on certain pages

Bug #1558076 reported by b0r3d0m
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8:

from bs4 import BeautifulSoup

with open('page.html', 'r') as f:
    content = f.read()
    for i in xrange(1000000000):
        print(i)
        soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly

As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report.

==================================

There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "etree.pyd"):

Problem signature:
  Problem Event Name: APPCRASH
  Application Name: python.exe
  Application Version: 0.0.0.0
  Application Timestamp: 56634a05
  Fault Module Name: etree.pyd
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp: 56470805
  Exception Code: c0000005
  Exception Offset: 0011e3fa
  OS Version: 6.2.9200.2.0.0.768.100
  Locale ID: 1033
  Additional Information 1: 5861
  Additional Information 2: 5861822e1919d7c014bbb064c64908b2
  Additional Information 3: dac6
  Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1

Read our privacy statement online:
  http://go.microsoft.com/fwlink/?linkid=190175

If the online privacy statement is not available, please read our privacy statement offline:
  C:\Windows\system32\en-US\erofflps.txt

==================================

Moreover I noticed that the following code doesn't crash at all:

from lxml import etree

with open('page.html', 'r') as f:
    content = f.read()
    for i in xrange(1000000000):
        print(i)
        tree = etree.HTML(content)

I know that there must be some error in BeautifulSoup library then but I think that the incorrect usage of lxml should not crash an interpreter anyway.

==================================

lxml versions -- 3.4.4 and 3.5.0
BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)

Tags: crash
Revision history for this message
b0r3d0m (nikita-trophimov) wrote :
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.