Activity log for bug #2046398

Date Who What changed Old value New value Message
2023-12-14 05:00:14 Nick Young bug added bug
2023-12-14 06:36:56 Nick Young summary malformed HTML causes lxml to hang inserting a parent into it's child causes lxml to hang
2023-12-14 06:39:58 Nick Young description Hi, I was tracking down a bug in a larger Python project, and have isolated it to lxml. The bug occurs with malformed HTML. I've created a simplified test script, with just the relevant malformed HTML. When the line of code "child.insert(0, parent)" is run, the script hangs, and one CPU is pinned at 100%. This is probably caused by an infinite loop. Here's the test script: #!/usr/bin/env python3 from lxml import etree, html, cssselect import sys print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) PARSER = etree.HTMLParser(recover=True) select_parent = cssselect.CSSSelector("#parent") select_child = cssselect.CSSSelector('#child') doc = html.fromstring(""" <div id="parent"> <div id="child"> <div></div> """, parser=PARSER) print(doc) parent = select_parent(doc)[0] print(parent) child = select_child(doc)[0] print(child) # THIS LINE HANGS child.insert(0, parent) print("DONE!") output: Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0) lxml.etree : (4, 9, 3, 0) libxml used : (2, 10, 3) libxml compiled : (2, 10, 3) libxslt used : (1, 1, 38) libxslt compiled : (1, 1, 38) <Element div at 0x7ff24668f500> <Element div at 0x7ff24668f500> <Element div at 0x7ff24668f4c0> Cheers, Nick Hi, I was tracking down a bug in a larger Python project, and have isolated it to lxml. The bug originally occurred due to malformed HTML, but also occurs if I fix the HTML. I've created a simplified test script, with just the relevant HTML. When the line of code "child.insert(0, parent)" is run, the script hangs, and one CPU is pinned at 100%. This is probably caused by an infinite loop. Here's the test script: #!/usr/bin/env python3 from lxml import etree, html, cssselect import sys print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) PARSER = etree.HTMLParser(recover=True) select_parent = cssselect.CSSSelector("#parent") select_child = cssselect.CSSSelector('#child') doc = html.fromstring(""" <div id="parent"> <div id="child"> <div></div> </div> </div> """, parser=PARSER) print(doc) parent = select_parent(doc)[0] print(parent) child = select_child(doc)[0] print(child) # THIS LINE HANGS child.insert(0, parent) print("DONE!") output: Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0) lxml.etree : (4, 9, 3, 0) libxml used : (2, 10, 3) libxml compiled : (2, 10, 3) libxslt used : (1, 1, 38) libxslt compiled : (1, 1, 38) <Element div at 0x7ff24668f500> <Element div at 0x7ff24668f500> <Element div at 0x7ff24668f4c0> If the inner <div></div> is removed, lxml throws "ValueError: cannot append parent to itself" Cheers, Nick
2023-12-17 11:59:08 scoder lxml: importance Undecided Medium
2023-12-17 11:59:08 scoder lxml: status New Fix Committed
2023-12-17 11:59:08 scoder lxml: assignee scoder (scoder)
2023-12-17 12:00:26 scoder lxml: milestone 4.9.4
2023-12-21 10:28:07 scoder lxml: status Fix Committed Fix Released