lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !

Bug #1849810 reported by Nico Schlömer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

When trying parse one of these files [1] with lxml (e.g., dragon.vtu), I'm getting
```
Original exception was:
Traceback (most recent call last):
  File "d.py", line 4, in <module>
    tree = ET.parse("dragon.vtu", parser)
  File "src/lxml/etree.pyx", line 3467, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "dragon.vtu", line 28
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC9 0x0C 0x00 0x00, line 28, column 6
```
The critical part is indeed not proper UTF-8, but the encoding is correctly indicated as raw.
```
  <AppendedData encoding="raw">
```

Env:
```
Python : sys.version_info(major=3, minor=8, micro=0, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
```

[1] https://github.com/topology-tool-kit/ttk-data

Revision history for this message
Nico Schlömer (nschloe) wrote :

It appears that this is not a bug in lxml but rather that the file violates the XML specification. The VTU spec says:

> There is one case in which the file is not a valid XML document. When the AppendedData section is not encoded as base64, raw binary data is present that may violate the XML specification. This is not default behavior, and must be explicitly enabled by the user.

This issue can be closed.

scoder (scoder)
Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.