lxml.html.html5parser crashes with html5lib when given unicode input

Bug #1654544 reported by Elias Dorneles da Silveira Junior on 2017-01-06
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

Using the latest version of both lxml and html5lib:

>>> import html5lib
>>> html5lib.__version__
u'0.999999999'
>>> import lxml.etree
>>> lxml.etree.LXML_VERSION
(3, 7, 1, 0)

Trying to use html5parser.fromstring with an unicode text input fails with TypeError unexpected keyword argument:

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.fromstring(u'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/lxml/html/html5parser.py", line 147, in fromstring
    guess_charset=guess_charset)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/lxml/html/html5parser.py", line 64, in document_fromstring
    return parser.parse(html, useChardet=guess_charset).getroot()
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/elias/.virtualenvs/tmp-6aaa3c35e219018b/local/lib/python2.7/site-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'useChardet'

Details about installed packages:

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 7, 1, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

I also get the same problem using Python 3:

$ python
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.fromstring('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/lxml/html/html5parser.py", line 147, in fromstring
    guess_charset=guess_charset)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/lxml/html/html5parser.py", line 64, in document_fromstring
    return parser.parse(html, useChardet=guess_charset).getroot()
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/elias/.virtualenvs/tmp-200d3a9b52ebdd89/lib/python3.4/site-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'useChardet'

description: updated
description: updated
Ondergetekende (kvdveer) wrote :

The reason that this wasn't picked up by CI, was that the html5lib tests were ignored due to missing html5lib in the CI environment.

To trigger CI, I also created a github pull request: https://github.com/lxml/lxml/pull/232

Ondergetekende (kvdveer) wrote :

And the CI patch

David Allouche (ddaa) wrote :

Bug still present in lxml==3.8.0.

scoder (scoder) wrote :
Changed in lxml:
milestone: none → 3.9.0
importance: Undecided → Medium
status: New → Fix Committed
scoder (scoder) wrote :

Note that I have backed out PR #232. Instead, I changed the option setup to pass "useChardet" only if it appears safe, and leave it to the user to request it explicitly if needed. In the case of file(-like) object input, for example, users would know best if they opened the file in bytes mode or with an encoding.

scoder (scoder) on 2017-09-19
Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers