lxml

lxml.html.html5parser crashes when given non-unicode input in python3

Bug #1673355 reported by Ondergetekende on 2017-03-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	Ondergetekende

Bug Description

The first thing html5parser.fromstring does, is check whether the document starts with '<html', however, it does that using string characters, not byte characters. This is fine when the input is already decoded (but that triggers #1654544), but when the input is in bytes, this crashes. (see reproduction case).

Between this bug and #1654544, there is no possible way to use html5parser.fromstring in python3.

## Version info

Python : sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

## Reproduction:

lxml.html.html5parser.fromstring(b"<html></html>")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-4992379850ea> in <module>()
----> 1 lxml.html.html5parser.fromstring(b"<html></html>")

/home/koert/.virtualenvs/aquatic_turd/lib/python3.5/site-packages/lxml/html/html5parser.py in fromstring(html, guess_charset, parser)
    149 # document starts with doctype or <html>, full document!
    150 start = html[:50].lstrip().lower()
--> 151 if start.startswith('<html') or start.startswith('<!doctype'):
    152 return doc
    153