lxml.html.html5parser crashes with html5lib when given unicode input
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Using the latest version of both lxml and html5lib:
>>> import html5lib
>>> html5lib.
u'0.999999999'
>>> import lxml.etree
>>> lxml.etree.
(3, 7, 1, 0)
Trying to use html5parser.
$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/
guess_
File "/home/
return parser.parse(html, useChardet=
File "/home/
self.
File "/home/
self.tokenizer = _tokenizer.
File "/home/
self.stream = HTMLInputStream
File "/home/
return HTMLUnicodeInpu
TypeError: __init__() got an unexpected keyword argument 'useChardet'
Details about installed packages:
Python : sys.version_
lxml.etree : (3, 7, 1, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
I also get the same problem using Python 3:
$ python
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import html5parser
>>> html5parser.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/
guess_
File "/home/
return parser.parse(html, useChardet=
File "/home/
self.
File "/home/
self.tokenizer = _tokenizer.
File "/home/
self.stream = HTMLInputStream
File "/home/
return HTMLUnicodeInpu
TypeError: __init__() got an unexpected keyword argument 'useChardet'
description: | updated |
description: | updated |
Changed in lxml: | |
status: | Fix Committed → Fix Released |
The reason that this wasn't picked up by CI, was that the html5lib tests were ignored due to missing html5lib in the CI environment.
To trigger CI, I also created a github pull request: https:/ /github. com/lxml/ lxml/pull/ 232