tostring of single element also returns rest of whole document (when xhmtl1 dtd is declared)

Bug #1970741 reported by Wolfgang Schnerring
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
High
Unassigned

Bug Description

I've discovered a strange edge case bug starting in lxml-4.7.1 (in 4.6.5 it's fine): When a document declares an xhtml1 DTD (but not other DTDs), etree.tostring() of a single element returns not only that element, but everything that comes after as well (which then is not even well-formed XML). Here's a reproduction recipe:

import lxml.etree
html = """\
<!DOCTYPE anything PUBLIC "anything" "{dtd}">
<html>
  <body>
    <div id="one">one</div>
    <div id="two">two</div>
  </body>
</html>"""
div = '<div id="one">one</div>'

for dtd in [
        'http://www.w3.org/TR/html401/strict.dtd',
        'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd',
        'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd',
        'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd']:
    doc = lxml.etree.fromstring(html.format(dtd=dtd))
    el = doc.getchildren()[0].getchildren()[0]
    text = lxml.etree.tostring(el, encoding=str)
    if text.strip() == div:
        print('PASS %s' % dtd)
    else:
        print('FAIL %s:\n%s' % (dtd, text))

With 4.7.1 and 4.8.0 this is the resulting output (4.6.5 gives 4xPASS):
PASS http://www.w3.org/TR/html401/strict.dtd
PASS http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
FAIL http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd:
<div id="one">one</div>
    <div id="two">two</div>
  </body>
</html>

FAIL http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd:
<div id="one">one</div>
    <div id="two">two</div>
  </body>
</html>

Here's the detailed version info:
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (4, 8, 0, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 9, 12)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 12)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 34)

Revision history for this message
Merlijn Wajer (wizzup1) wrote :

I am hitting this same bug on archive.org's archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools), it indeed started happening since 4.7.x (4.6.5 is fine), causing all kinds of corruption. I haven't been able to find a workaround so far.

Revision history for this message
scoder (scoder) wrote :

Might be related to
https://bugs.launchpad.net/lxml/+bug/1928795
although 4.8.0 (or a local build with a suitable libxml2) should be ok.

Probably not an issue with lxml but with libxml2.

Let's see what surprises libxml2 2.9.14 has for us. :)

Revision history for this message
Merlijn Wajer (wizzup1) wrote :

I just built lxml 2.9.14 locally, and then build lxml 4.8.0 against it, and the same bug is still present, see example files and script. (test-2.html should be ok, test.html will fail because of the dtd)

Revision history for this message
Merlijn Wajer (wizzup1) wrote :

Can't figure out how to edit my comment but obviously I built _libxml2_ 2.9.14 locally, and then lxml 4.8.0.

Revision history for this message
Merlijn Wajer (wizzup1) wrote :

So I think I can confirm that libxml 2.9.14 doesn't solve the problem for me.

Revision history for this message
Merlijn Wajer (wizzup1) wrote :

Any update on this? This is getting particularly critical since lxml 4.6.5 doesn't work anymore on the latest versions of Python.

Revision history for this message
Jim (wisnij) wrote :

This is still happening for me with lxml 4.9.2 and libxml2 2.9.14: https://<email address hidden>/thread/CELGW3BXCYIZFGCLIBNBBTT5HY5EYCRC/

Revision history for this message
scoder (scoder) wrote :

I can still reproduce this with libxml2 2.11.5.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
scoder (scoder)
Changed in lxml:
importance: Medium → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.