lxml

utf-8 BOM creates issues with parsing with fromstring

Bug #1789041 reported by Tim Tisdall on 2018-08-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> t = u"""\xef\xbb\xbf<!DOCTYPE html><html><head><title>test</title></head><body><h1>test</h1></body></html>"""
>>> tree = fromstring(t)
>>> print(tree)
<Element div at 0x7fdd6c9de940>
>>> tree.head
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 298, in head
    return self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
IndexError: list index out of range
>>>

According to Wikipedia the `EF BB BF` is the BOM for UTF-8

Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 2, 4, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

See original description

Tim Tisdall (tisdall) on 2018-08-25

description:

updated

Revision history for this message

Tim Tisdall (tisdall) wrote on 2018-08-27:

I think the problem here is I'm trying to pass u"\xef\xbb\xbf" when it should be "\xef\xbb\xbf" (note the first is unicode). The former implies an encoding while the later has none. The later is properly detected as utf-8 and works correctly.

Changed in lxml:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.