Activity log for bug #1911356

Date Who What changed Old value New value Message
2021-01-13 00:03:47 Chris Wolf bug added bug
2021-01-13 00:03:47 Chris Wolf attachment added Python test script https://bugs.launchpad.net/bugs/1911356/+attachment/5452422/+files/testcase.py
2021-01-13 00:04:57 Chris Wolf description In my real-word case, I have XML files with "windows-1252" encoding. Sometimes the text body of XML elements has a bit of incorrectly encoded characters and the parser just gives up and throws an error. The parser should just pass-through such text-body strings, or have an option to allow that. However, the lxml parser converts the junk characters into numeric XML entities, which may be good for some situations, but not mine. Note that "xmllint --format sample.xml" passes through without converting to numeric entities. Specifically: ?xml version="1.0" encoding="windows-1252"?> <ROOT> <SNM>Sͨne</SNM><!-- string is hex: 53 c3 8d c2 a8 6e 65 --> </ROOT> Should NOT result in: <?xml version="1.0" encoding="ASCII"?> <ROOT> <SNM>S&#205;&#168;ne</SNM> </ROOT> Script to reproduce is attached. In my real-word case, I have XML files with "windows-1252" encoding. Sometimes the text body of XML elements has a bit of incorrectly encoded characters and the parser converts these to XML numeric entities. The parser should just pass-through such text-body strings, or have an option to allow that. However, the lxml parser converts the junk characters into numeric XML entities, which may be good for some situations, but not mine. Note that "xmllint --format sample.xml" passes through without converting to numeric entities. Specifically: ?xml version="1.0" encoding="windows-1252"?> <ROOT>   <SNM>Sͨne</SNM><!-- string is hex: 53 c3 8d c2 a8 6e 65 --> </ROOT> Should NOT result in: <?xml version="1.0" encoding="ASCII"?> <ROOT>   <SNM>S&#205;&#168;ne</SNM> </ROOT> Script to reproduce is attached.