How to handle a source file with a BOM
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
This may not really be a bug but my library has been hit by it: any file that starts with a BOM, in this case <U+FEFF> will not work with lxml. I couldn't find anything on the FAQ. Maybe there should be something?
See https:/
To reproduce the error just unzip the file attached to the bug and then
from lxml.etree import fromstring
p = fromstring("Issues/bug260/xl/worksheets/
Is there a way in lxml to fix this? Or should I just pass in the source from the first "<"?
Python : sys.version_
lxml.etree : (3, 3, 0, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
It should generally work for parsing files, but not for iterparse(), i.e. incremental parsing, which openpyxl uses AFAICT.
The code snippet you showed makes no sense, but I guess iterparse() is the problem here.
The fix would have to be implemented in lxml. This isn't easy to work around, I guess you'd have to implement your own file wrapper and make it skip over the BOM if you find one...