utf-8 BOM creates issues with parsing with fromstring

Bug #1789041 reported by Tim Tisdall on 2018-08-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> t = u"""\xef\xbb\xbf<!DOCTYPE html><html><head><title>test</title></head><body><h1>test</h1></body></html>"""
>>> tree = fromstring(t)
>>> print(tree)
<Element div at 0x7fdd6c9de940>
>>> tree.head
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 298, in head
    return self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
IndexError: list index out of range
>>>

According to Wikipedia the `EF BB BF` is the BOM for UTF-8

Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 2, 4, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

Tim Tisdall (tisdall) on 2018-08-25
description: updated
Tim Tisdall (tisdall) wrote :

I think the problem here is I'm trying to pass u"\xef\xbb\xbf" when it should be "\xef\xbb\xbf" (note the first is unicode). The former implies an encoding while the later has none. The later is properly detected as utf-8 and works correctly.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers