utf-8 BOM creates issues with parsing with fromstring
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
>>> from lxml.html import fromstring
>>> t = u"""\xef\
>>> tree = fromstring(t)
>>> print(tree)
<Element div at 0x7fdd6c9de940>
>>> tree.head
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/
return self.xpath(
IndexError: list index out of range
>>>
According to Wikipedia the `EF BB BF` is the BOM for UTF-8
Python : sys.version_
lxml.etree : (4, 2, 4, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)
I think the problem here is I'm trying to pass u"\xef\xbb\xbf" when it should be "\xef\xbb\xbf" (note the first is unicode). The former implies an encoding while the later has none. The later is properly detected as utf-8 and works correctly.