lxml

Bug #1558076
Activity log

Activity log for bug #1558076

Date	Who	What changed	Old value	New value	Message
2016-03-16 13:33:54	b0r3d0m	bug			added bug
2016-03-16 13:33:54	b0r3d0m	attachment added		The page that forces lxml crash https://bugs.launchpad.net/bugs/1558076/+attachment/4601120/+files/page.html
2016-03-16 13:42:43	b0r3d0m	description	The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8: from bs4 import BeautifulSoup with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report. ================================== There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "lxml.etree.pyd"): Problem signature: Problem Event Name: APPCRASH Application Name: emls_aggregator_helper.exe Application Version: 0.0.0.0 Application Timestamp: 514e2c2e Fault Module Name: lxml.etree.pyd Fault Module Version: 0.0.0.0 Fault Module Timestamp: 553ba758 Exception Code: c0000005 Exception Offset: 000ed4aa OS Version: 6.2.9200.2.0.0.768.100 Locale ID: 1033 Additional Information 1: 5861 Additional Information 2: 5861822e1919d7c014bbb064c64908b2 Additional Information 3: dac6 Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1 Read our privacy statement online: http://go.microsoft.com/fwlink/?linkid=190175 If the online privacy statement is not available, please read our privacy statement offline: C:\Windows\system32\en-US\erofflps.txt ================================== Moreover I noticed that the following code doesn't crash at all: from lxml import etree with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) tree = etree.HTML(content) I know that there must be some error in BeautifulSoup library then but I think that the incorrect usage of lxml should not crash an interpreter anyway. ================================== lxml versions -- 3.4.4 and 3.5.0 BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)	The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8: from bs4 import BeautifulSoup with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report. ================================== There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "lxml.etree.pyd"): Problem signature: Problem Event Name: APPCRASH Application Name: python.exe Application Version: 0.0.0.0 Application Timestamp: 514e2c2e Fault Module Name: lxml.etree.pyd Fault Module Version: 0.0.0.0 Fault Module Timestamp: 553ba758 Exception Code: c0000005 Exception Offset: 000ed4aa OS Version: 6.2.9200.2.0.0.768.100 Locale ID: 1033 Additional Information 1: 5861 Additional Information 2: 5861822e1919d7c014bbb064c64908b2 Additional Information 3: dac6 Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1 Read our privacy statement online: http://go.microsoft.com/fwlink/?linkid=190175 If the online privacy statement is not available, please read our privacy statement offline: C:\Windows\system32\en-US\erofflps.txt ================================== Moreover I noticed that the following code doesn't crash at all: from lxml import etree with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) tree = etree.HTML(content) I know that there must be some error in BeautifulSoup library then but I think that the incorrect usage of lxml should not crash an interpreter anyway. ================================== lxml versions -- 3.4.4 and 3.5.0 BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)
2016-03-16 13:43:46	b0r3d0m	description	The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8: from bs4 import BeautifulSoup with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report. ================================== There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "lxml.etree.pyd"): Problem signature: Problem Event Name: APPCRASH Application Name: python.exe Application Version: 0.0.0.0 Application Timestamp: 514e2c2e Fault Module Name: lxml.etree.pyd Fault Module Version: 0.0.0.0 Fault Module Timestamp: 553ba758 Exception Code: c0000005 Exception Offset: 000ed4aa OS Version: 6.2.9200.2.0.0.768.100 Locale ID: 1033 Additional Information 1: 5861 Additional Information 2: 5861822e1919d7c014bbb064c64908b2 Additional Information 3: dac6 Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1 Read our privacy statement online: http://go.microsoft.com/fwlink/?linkid=190175 If the online privacy statement is not available, please read our privacy statement offline: C:\Windows\system32\en-US\erofflps.txt ================================== Moreover I noticed that the following code doesn't crash at all: from lxml import etree with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) tree = etree.HTML(content) I know that there must be some error in BeautifulSoup library then but I think that the incorrect usage of lxml should not crash an interpreter anyway. ================================== lxml versions -- 3.4.4 and 3.5.0 BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)	The following code randomly crashes Python interpreter (both 2.7.6 and 2.7.11 versions) on Windows 8: from bs4 import BeautifulSoup with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) soup = BeautifulSoup(content, 'lxml') # 'html.parser' and 'html5lib' parsers works perfectly As I stated in the summary of this bug, the crash happens only on certain pages, so I attached an example of such file to this report. ================================== There's no additional output in stdout / stderr so the only information I have at the moment is the standard error info from the corresponding Windows dialog (note that the Fault Module Name is "etree.pyd"): Problem signature: Problem Event Name: APPCRASH Application Name: python.exe Application Version: 0.0.0.0 Application Timestamp: 56634a05 Fault Module Name: etree.pyd Fault Module Version: 0.0.0.0 Fault Module Timestamp: 56470805 Exception Code: c0000005 Exception Offset: 0011e3fa OS Version: 6.2.9200.2.0.0.768.100 Locale ID: 1033 Additional Information 1: 5861 Additional Information 2: 5861822e1919d7c014bbb064c64908b2 Additional Information 3: dac6 Additional Information 4: dac6c2650fa14dd558bd9f448e23afd1 Read our privacy statement online: http://go.microsoft.com/fwlink/?linkid=190175 If the online privacy statement is not available, please read our privacy statement offline: C:\Windows\system32\en-US\erofflps.txt ================================== Moreover I noticed that the following code doesn't crash at all: from lxml import etree with open('page.html', 'r') as f: content = f.read() for i in xrange(1000000000): print(i) tree = etree.HTML(content) I know that there must be some error in BeautifulSoup library then but I think that the incorrect usage of lxml should not crash an interpreter anyway. ================================== lxml versions -- 3.4.4 and 3.5.0 BeautifulSoup version -- 4.4.1 (the latest one at the time of writing)