lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
When trying parse one of these files [1] with lxml (e.g., dragon.vtu), I'm getting
```
Original exception was:
Traceback (most recent call last):
File "d.py", line 4, in <module>
tree = ET.parse(
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "dragon.vtu", line 28
lxml.etree.
Bytes: 0xC9 0x0C 0x00 0x00, line 28, column 6
```
The critical part is indeed not proper UTF-8, but the encoding is correctly indicated as raw.
```
<AppendedData encoding="raw">
```
Env:
```
Python : sys.version_
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
```
Changed in lxml: | |
status: | New → Invalid |
It appears that this is not a bug in lxml but rather that the file violates the XML specification. The VTU spec says:
> There is one case in which the file is not a valid XML document. When the AppendedData section is not encoded as base64, raw binary data is present that may violate the XML specification. This is not default behavior, and must be explicitly enabled by the user.
This issue can be closed.