lxml parser should just pass-through junk characters for robustness

Bug #1911356 reported by Chris Wolf
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

In my real-word case, I have XML files with "windows-1252" encoding. Sometimes the text body of XML elements has a bit of incorrectly encoded characters and the parser converts these to XML numeric entities. The parser should just pass-through such text-body strings, or have an option to allow that.

However, the lxml parser converts the junk characters into numeric XML entities, which may be good for some situations, but not mine. Note that "xmllint --format sample.xml" passes through without converting to numeric entities.

Specifically:
?xml version="1.0" encoding="windows-1252"?>
<ROOT>
  <SNM>Sͨne</SNM><!-- string is hex: 53 c3 8d c2 a8 6e 65 -->
</ROOT>

Should NOT result in:
<?xml version="1.0" encoding="ASCII"?>
<ROOT>
  <SNM>S&#205;&#168;ne</SNM>
</ROOT>

Script to reproduce is attached.

Revision history for this message
Chris Wolf (wolfch) wrote :
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.