Serializing without specified encoding corrupts document

Bug #2008639 reported by Bob Kline
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

I recently ran a global transformation job on 1,042 XML documents using lxml and on inspection I noticed that three of the documents had been corrupted, each losing a segment of a text node in a part of the document completely unrelated to the transformation. Upgrading to the latest lxml using pip did not eliminate the corruption. I have created a repro case using one of the three documents.

The environment (on Windows Server 2016 Standard):

Python : sys.version_info(major=3, minor=10, micro=1, releaselevel='final', serial=0)
lxml.etree : (4, 9, 2, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

The repro script:

#!/usr/bin/env python3

from sys import version_info
from lxml import etree

ENVIRONMENT = (
    ("Python", version_info),
    ("lxml.etree", etree.LXML_VERSION),
    ("libxml used", etree.LIBXML_VERSION),
    ("libxml compiled", etree.LIBXML_COMPILED_VERSION),
    ("libxslt used", etree.LIBXSLT_VERSION),
    ("libxslt compiled", etree.LIBXSLT_COMPILED_VERSION),
)

for label, value in ENVIRONMENT:
    print(f"{label:<20s}: {value}")

root = etree.parse("original.xml").getroot()
for node in root.findall("AltTitle"):
    if node.get("TitleType") == "Browser":
        title = "".join(node.itertext("*")).strip()
        title += " (PDQ\xae)"
        node.text = title
with open("corrupted.xml", "wb") as fp:
    fp.write(etree.tostring(root))

### END OF REPRO SCRIPT ###

I will attach the script, with the input XML document and the corrupted result. After running the script, examining the serialization of the transformed file will show that

... en los niños y adolescentes sobrevivientes de cáncer, consultar el sumario del PDQ ...

on line 64 of the original file has become

... en los ni&#241;os y adolescentes sobrevivientes de c&#225; ...

on that same line of the corrupted output document, chopping from the middle of the word "cáncer" through "PDQ" and dropping that fragment on the floor.

The corruption is avoided if encoding="utf-8" is specified in the invocation of tostring(), replacing the ASCII output bytes with entities by UTF-8 output bytes. Swapping in ElementTree from the Python standard library ("from xml.etree import ElementTree as etree") also eliminates the corruption (at the cost of slower processing).

Revision history for this message
Bob Kline (bob.kline) wrote :

Here is the promised attachment.

Revision history for this message
scoder (scoder) wrote :

This is most likely due to your use of libxml2 2.9.12. The official wheels of lxml 4.9.2 should actually include libxml2 2.9.14 instead. Could you try those?

Changed in lxml:
status: New → Triaged
Revision history for this message
Bob Kline (bob.kline) wrote :

So, pip is installing something which is not the official wheel? How can that happen?

Revision history for this message
scoder (scoder) wrote :

Ah, sorry, I didn't see that you're using Windows. Sadly, the official Windows wheels aren't up to date w.r.t. the libxml2 version.
So, yes, that's a bug, until we get the libraries updated for Windows.

See https://github.com/lxml/libxml2-win-binaries/pull/3

Changed in lxml:
importance: Undecided → Medium
status: Triaged → Confirmed
Revision history for this message
Bob Kline (bob.kline) wrote :

Thanks. I'm a little surprised at the setting of importance to "Medium" as I think of bugs which corrupt data as only exceeded in seriousness by bugs which set the computer on fire. 🔥 I can well imagine that most users would prefer a bug which prevented software from running at all to a bug which silently corrupts their data.

Revision history for this message
scoder (scoder) wrote :

> I'm a little surprised at the setting of importance to "Medium" as I think of bugs which corrupt data as only exceeded in seriousness by bugs which set the computer on fire.

I mostly agree. However, it's only an issue on Windows, which is not an important platform for most users. Linux is still widely dominant for processing platforms.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.