lxml

How to handle a source file with a BOM

Bug #1274118 reported by Charlie_X on 2014-01-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder

Bug Description

This may not really be a bug but my library has been hit by it: any file that starts with a BOM, in this case <U+FEFF> will not work with lxml. I couldn't find anything on the FAQ. Maybe there should be something?

See https://bitbucket.org/ericgazoni/openpyxl/issue/260/182-crashes-when-trying-to-find-dimension for the downstream problem.

To reproduce the error just unzip the file attached to the bug and then

from lxml.etree import fromstring
p = fromstring("Issues/bug260/xl/worksheets/sheet1.xml")

Is there a way in lxml to fix this? Or should I just pass in the source from the first "<"?

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 3, 0, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2014-01-29:

It should generally work for parsing files, but not for iterparse(), i.e. incremental parsing, which openpyxl uses AFAICT.

The code snippet you showed makes no sense, but I guess iterparse() is the problem here.

The fix would have to be implemented in lxml. This isn't easy to work around, I guess you'd have to implement your own file wrapper and make it skip over the BOM if you find one...

Revision history for this message

Charlie_X (charlie) wrote on 2014-01-29:

Yes, we use incremental parsing because some of the files can be quite big.

You get a clearer error when using "fromstring" which is why I used it and it looks like the BOM is for UTF-16 despite the declared encoding of UTF-8

The code and error with iterparse:

it = iterparse("Issues/bug260/xl/worksheets/sheet1.xml")
<lxml.etree.iterparse object at 0x10d865b90>

for e, t in it: print e
Traceback (most recent call last):
  File "/Applications/WingIDE.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "/Users/charlieclark/Projects/openpyxl/lib/python2.7/site-packages/lxml/etree.so", line 179, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:124400)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

I'll see if I can come up with a workaround for openpyxl. It's a bit tricky because we interface with files inside a zip-archive. But maybe lxml could come up with a nicer error? Close to the one if fromstring is used?

Revision history for this message

scoder (scoder) wrote on 2014-01-29:

Implemented here:

https://github.com/lxml/lxml/commit/d15fa099057968bbe7a407c7cee6d0b17245dcec