lxml.html.html5parser crashes with html5lib when given unicode input

Bug #1654544 reported by Elias Dorneles da Silveira Junior
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
Unassigned

Bug Description

Using the latest version of both lxml and html5lib:

>>> import html5lib
>>> html5lib.__version__
u'0.999999999'
>>> import lxml.etree
>>> lxml.etree.LXML_VERSION
(3, 7, 1, 0)

Trying to use html5parser.fromstring with an unicode text input fails with TypeError unexpected keyword argument:

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.fromstring(u'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/lxml/html/html5parser.py", line 147, in fromstring
    guess_charset=guess_charset)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/lxml/html/html5parser.py", line 64, in document_fromstring
    return parser.parse(html, useChardet=guess_charset).getroot()
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'useChardet'

Details about installed packages:

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 7, 1, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

I also get the same problem using Python 3:

$ python
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.fromstring('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/lxml/html/html5parser.py", line 147, in fromstring
    guess_charset=guess_charset)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/lxml/html/html5parser.py", line 64, in document_fromstring
    return parser.parse(html, useChardet=guess_charset).getroot()
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'useChardet'

description: updated
description: updated
Revision history for this message
Ondergetekende (kvdveer) wrote :

The reason that this wasn't picked up by CI, was that the html5lib tests were ignored due to missing html5lib in the CI environment.

To trigger CI, I also created a github pull request: https://github.com/lxml/lxml/pull/232

Revision history for this message
Ondergetekende (kvdveer) wrote :

And the CI patch

Revision history for this message
David Allouche (ddaa) wrote :

Bug still present in lxml==3.8.0.

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
milestone: none → 3.9.0
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Note that I have backed out PR #232. Instead, I changed the option setup to pass "useChardet" only if it appears safe, and leave it to the user to request it explicitly if needed. In the case of file(-like) object input, for example, users would know best if they opened the file in bytes mode or with an encoding.

scoder (scoder)
Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.