lxml.html.html5parser crashes when given non-unicode input in python3

Bug #1673355 reported by Ondergetekende
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fix Released

Bug Description

The first thing html5parser.fromstring does, is check whether the document starts with '<html', however, it does that using string characters, not byte characters. This is fine when the input is already decoded (but that triggers #1654544), but when the input is in bytes, this crashes. (see reproduction case).

Between this bug and #1654544, there is no possible way to use html5parser.fromstring in python3.

## Version info

Python : sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

## Reproduction:

TypeError Traceback (most recent call last)
<ipython-input-5-4992379850ea> in <module>()
----> 1 lxml.html.html5parser.fromstring(b"<html></html>")

/home/koert/.virtualenvs/aquatic_turd/lib/python3.5/site-packages/lxml/html/html5parser.py in fromstring(html, guess_charset, parser)
    149 # document starts with doctype or <html>, full document!
    150 start = html[:50].lstrip().lower()
--> 151 if start.startswith('<html') or start.startswith('<!doctype'):
    152 return doc

TypeError: startswith first arg must be bytes or a tuple of bytes, not str

Revision history for this message
Ondergetekende (kvdveer) wrote :

Patched & added unit tests

scoder (scoder)
Changed in lxml:
assignee: nobody → Ondergetekende (kvdveer)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Released in lxml 3.8.0.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.