After a certain point in a Unicode document, HTMLParser stops sending tag events
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Undecided
|
Unassigned |
Bug Description
This bug was originally reported by one of my users, here: https:/
"LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags."
I reproduced this problem using only lxml code (attached), so I'm filing the issue here.
This looks very similar to https:/
My version information:
Python : sys.version_
lxml.etree : (4, 2, 3, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)
This was originally reported on Python 3.6.5 and lxml 4.2.1.
Changed in lxml: | |
status: | New → Triaged |
Thanks for the report. I get this output for libxml2 2.9.8:
11080 tags when fed bytes
557 tags when fed unicode
and this for libxml2 2.9.7 (and 2.9.3):
11080 tags when fed bytes
11080 tags when fed unicode
This suggests that there is a ... difference in behaviour ... in libxml2 2.9.8.
I don't currently have the time to check if it's a known bug or if there is a workaround.