How to handle a source file with a BOM

Bug #1274118 reported by Charlie_X
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

This may not really be a bug but my library has been hit by it: any file that starts with a BOM, in this case <U+FEFF> will not work with lxml. I couldn't find anything on the FAQ. Maybe there should be something?

See https://bitbucket.org/ericgazoni/openpyxl/issue/260/182-crashes-when-trying-to-find-dimension for the downstream problem.

To reproduce the error just unzip the file attached to the bug and then

from lxml.etree import fromstring
p = fromstring("Issues/bug260/xl/worksheets/sheet1.xml")

Is there a way in lxml to fix this? Or should I just pass in the source from the first "<"?

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 3, 0, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

It should generally work for parsing files, but not for iterparse(), i.e. incremental parsing, which openpyxl uses AFAICT.

The code snippet you showed makes no sense, but I guess iterparse() is the problem here.

The fix would have to be implemented in lxml. This isn't easy to work around, I guess you'd have to implement your own file wrapper and make it skip over the BOM if you find one...

Revision history for this message
Charlie_X (charlie) wrote :

Yes, we use incremental parsing because some of the files can be quite big.

You get a clearer error when using "fromstring" which is why I used it and it looks like the BOM is for UTF-16 despite the declared encoding of UTF-8

The code and error with iterparse:

it = iterparse("Issues/bug260/xl/worksheets/sheet1.xml")
<lxml.etree.iterparse object at 0x10d865b90>

for e, t in it: print e
Traceback (most recent call last):
  File "/Applications/WingIDE.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "/Users/charlieclark/Projects/openpyxl/lib/python2.7/site-packages/lxml/etree.so", line 179, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:124400)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

I'll see if I can come up with a workaround for openpyxl. It's a bit tricky because we interface with files inside a zip-archive. But maybe lxml could come up with a nicer error? Close to the one if fromstring is used?

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.3.1.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.