Comment 0 for bug 2046398

Revision history for this message
Nick Young (nyou045) wrote : malformed HTML causes lxml to hang

Hi,

I was tracking down a bug in a larger Python project, and have isolated it to lxml. The bug occurs with malformed HTML. I've created a simplified test script, with just the relevant malformed HTML. When the line of code "child.insert(0, parent)" is run, the script hangs, and one CPU is pinned at 100%. This is probably caused by an infinite loop. Here's the test script:

#!/usr/bin/env python3
from lxml import etree, html, cssselect

import sys

print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

PARSER = etree.HTMLParser(recover=True)
select_parent = cssselect.CSSSelector("#parent")
select_child = cssselect.CSSSelector('#child')

doc = html.fromstring("""
<div id="parent">
    <div id="child">
        <div></div>
""", parser=PARSER)
print(doc)
parent = select_parent(doc)[0]
print(parent)
child = select_child(doc)[0]
print(child)
# THIS LINE HANGS
child.insert(0, parent)
print("DONE!")

output:

Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 9, 3, 0)
libxml used : (2, 10, 3)
libxml compiled : (2, 10, 3)
libxslt used : (1, 1, 38)
libxslt compiled : (1, 1, 38)
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f4c0>

Cheers,
Nick