2023-12-14 06:39:58 |
Nick Young |
description |
Hi,
I was tracking down a bug in a larger Python project, and have isolated it to lxml. The bug occurs with malformed HTML. I've created a simplified test script, with just the relevant malformed HTML. When the line of code "child.insert(0, parent)" is run, the script hangs, and one CPU is pinned at 100%. This is probably caused by an infinite loop. Here's the test script:
#!/usr/bin/env python3
from lxml import etree, html, cssselect
import sys
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
PARSER = etree.HTMLParser(recover=True)
select_parent = cssselect.CSSSelector("#parent")
select_child = cssselect.CSSSelector('#child')
doc = html.fromstring("""
<div id="parent">
<div id="child">
<div></div>
""", parser=PARSER)
print(doc)
parent = select_parent(doc)[0]
print(parent)
child = select_child(doc)[0]
print(child)
# THIS LINE HANGS
child.insert(0, parent)
print("DONE!")
output:
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 9, 3, 0)
libxml used : (2, 10, 3)
libxml compiled : (2, 10, 3)
libxslt used : (1, 1, 38)
libxslt compiled : (1, 1, 38)
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f4c0>
Cheers,
Nick |
Hi,
I was tracking down a bug in a larger Python project, and have isolated it to lxml. The bug originally occurred due to malformed HTML, but also occurs if I fix the HTML. I've created a simplified test script, with just the relevant HTML. When the line of code "child.insert(0, parent)" is run, the script hangs, and one CPU is pinned at 100%. This is probably caused by an infinite loop. Here's the test script:
#!/usr/bin/env python3
from lxml import etree, html, cssselect
import sys
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
PARSER = etree.HTMLParser(recover=True)
select_parent = cssselect.CSSSelector("#parent")
select_child = cssselect.CSSSelector('#child')
doc = html.fromstring("""
<div id="parent">
<div id="child">
<div></div>
</div>
</div>
""", parser=PARSER)
print(doc)
parent = select_parent(doc)[0]
print(parent)
child = select_child(doc)[0]
print(child)
# THIS LINE HANGS
child.insert(0, parent)
print("DONE!")
output:
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 9, 3, 0)
libxml used : (2, 10, 3)
libxml compiled : (2, 10, 3)
libxslt used : (1, 1, 38)
libxslt compiled : (1, 1, 38)
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f500>
<Element div at 0x7ff24668f4c0>
If the inner <div></div> is removed, lxml throws "ValueError: cannot append parent to itself"
Cheers,
Nick |
|