lxml.html.html5parser crashes when given non-unicode input in python3

Bug #1673355 reported by Ondergetekende
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fix Released

Bug Description

The first thing html5parser.fromstring does, is check whether the document starts with '<html', however, it does that using string characters, not byte characters. This is fine when the input is already decoded (but that triggers #1654544), but when the input is in bytes, this crashes. (see reproduction case).

Between this bug and #1654544, there is no possible way to use html5parser.fromstring in python3.

## Version info

Python : sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

## Reproduction:

TypeError Traceback (most recent call last)
<ipython-input-5-4992379850ea> in <module>()
----> 1 lxml.html.html5parser.fromstring(b"<html></html>")

/home/koert/.virtualenvs/aquatic_turd/lib/python3.5/site-packages/lxml/html/html5parser.py in fromstring(html, guess_charset, parser)
    149 # document starts with doctype or <html>, full document!
    150 start = html[:50].lstrip().lower()
--> 151 if start.startswith('<html') or start.startswith('<!doctype'):
    152 return doc

TypeError: startswith first arg must be bytes or a tuple of bytes, not str

Revision history for this message
Ondergetekende (kvdveer) wrote :

Patched & added unit tests

scoder (scoder)
Changed in lxml:
assignee: nobody → Ondergetekende (kvdveer)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Released in lxml 3.8.0.

Changed in lxml:
status: Fix Committed → Fix Released
