utf-8 BOM creates issues with parsing with fromstring

Bug #1789041 reported by Tim Tisdall
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> t = u"""\xef\xbb\xbf<!DOCTYPE html><html><head><title>test</title></head><body><h1>test</h1></body></html>"""
>>> tree = fromstring(t)
>>> print(tree)
<Element div at 0x7fdd6c9de940>
>>> tree.head
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 298, in head
    return self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
IndexError: list index out of range
>>>

According to Wikipedia the `EF BB BF` is the BOM for UTF-8

Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 2, 4, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

Tim Tisdall (tisdall)
description: updated
Revision history for this message
Tim Tisdall (tisdall) wrote :

I think the problem here is I'm trying to pass u"\xef\xbb\xbf" when it should be "\xef\xbb\xbf" (note the first is unicode). The former implies an encoding while the later has none. The later is properly detected as utf-8 and works correctly.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.