lxml

Bug #1274118
Comment #2

Comment 2 for bug 1274118

Revision history for this message

Charlie_X (charlie) wrote on 2014-01-29:

Yes, we use incremental parsing because some of the files can be quite big.

You get a clearer error when using "fromstring" which is why I used it and it looks like the BOM is for UTF-16 despite the declared encoding of UTF-8

The code and error with iterparse:

it = iterparse("Issues/bug260/xl/worksheets/sheet1.xml")
<lxml.etree.iterparse object at 0x10d865b90>

for e, t in it: print e
Traceback (most recent call last):
  File "/Applications/WingIDE.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "/Users/charlieclark/Projects/openpyxl/lib/python2.7/site-packages/lxml/etree.so", line 179, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:124400)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

I'll see if I can come up with a workaround for openpyxl. It's a bit tricky because we interface with files inside a zip-archive. But maybe lxml could come up with a nicer error? Close to the one if fromstring is used?