Serializing without specified encoding corrupts document
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Medium
|
Unassigned |
Bug Description
I recently ran a global transformation job on 1,042 XML documents using lxml and on inspection I noticed that three of the documents had been corrupted, each losing a segment of a text node in a part of the document completely unrelated to the transformation. Upgrading to the latest lxml using pip did not eliminate the corruption. I have created a repro case using one of the three documents.
The environment (on Windows Server 2016 Standard):
Python : sys.version_
lxml.etree : (4, 9, 2, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
The repro script:
#!/usr/bin/env python3
from sys import version_info
from lxml import etree
ENVIRONMENT = (
("Python", version_info),
("lxml.etree", etree.LXML_
("libxml used", etree.LIBXML_
("libxml compiled", etree.LIBXML_
("libxslt used", etree.LIBXSLT_
("libxslt compiled", etree.LIBXSLT_
)
for label, value in ENVIRONMENT:
print(
root = etree.parse(
for node in root.findall(
if node.get(
title = "".join(
title += " (PDQ\xae)"
node.text = title
with open("corrupted
fp.
### END OF REPRO SCRIPT ###
I will attach the script, with the input XML document and the corrupted result. After running the script, examining the serialization of the transformed file will show that
... en los niños y adolescentes sobrevivientes de cáncer, consultar el sumario del PDQ ...
on line 64 of the original file has become
... en los niños y adolescentes sobrevivientes de cá ...
on that same line of the corrupted output document, chopping from the middle of the word "cáncer" through "PDQ" and dropping that fragment on the floor.
The corruption is avoided if encoding="utf-8" is specified in the invocation of tostring(), replacing the ASCII output bytes with entities by UTF-8 output bytes. Swapping in ElementTree from the Python standard library ("from xml.etree import ElementTree as etree") also eliminates the corruption (at the cost of slower processing).
Here is the promised attachment.