tostring of single element also returns rest of whole document (when xhmtl1 dtd is declared)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
High
|
Unassigned |
Bug Description
I've discovered a strange edge case bug starting in lxml-4.7.1 (in 4.6.5 it's fine): When a document declares an xhtml1 DTD (but not other DTDs), etree.tostring() of a single element returns not only that element, but everything that comes after as well (which then is not even well-formed XML). Here's a reproduction recipe:
import lxml.etree
html = """\
<!DOCTYPE anything PUBLIC "anything" "{dtd}">
<html>
<body>
<div id="one">one</div>
<div id="two">two</div>
</body>
</html>"""
div = '<div id="one">one</div>'
for dtd in [
'http://
'http://
'http://
'http://
doc = lxml.etree.
el = doc.getchildren
text = lxml.etree.
if text.strip() == div:
print('PASS %s' % dtd)
else:
print('FAIL %s:\n%s' % (dtd, text))
With 4.7.1 and 4.8.0 this is the resulting output (4.6.5 gives 4xPASS):
PASS http://
PASS http://
FAIL http://
<div id="one">one</div>
<div id="two">two</div>
</body>
</html>
FAIL http://
<div id="one">one</div>
<div id="two">two</div>
</body>
</html>
Here's the detailed version info:
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (4, 8, 0, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 9, 12)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 9, 12)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 34)
Changed in lxml: | |
importance: | Medium → High |
I am hitting this same bug on archive.org's archive-hocr-tools (https:/ /github. com/internetarc hive/archive- hocr-tools), it indeed started happening since 4.7.x (4.6.5 is fine), causing all kinds of corruption. I haven't been able to find a workaround so far.