After a certain point in a Unicode document, HTMLParser stops sending tag events

Bug #1781797 reported by Leonard Richardson on 2018-07-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

This bug was originally reported by one of my users, here: https://bugs.launchpad.net/beautifulsoup/+bug/1762514

"LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags."

I reproduced this problem using only lxml code (attached), so I'm filing the issue here.

This looks very similar to https://bugs.launchpad.net/beautifulsoup/+bug/963880, a bug I filed against lxml six years ago. In that case the problem happened almost immediately; here it happens on a larger document.

My version information:

Python : sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml.etree : (4, 2, 3, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

This was originally reported on Python 3.6.5 and lxml 4.2.1.

Leonard Richardson (leonardr) wrote :
scoder (scoder) wrote :

Thanks for the report. I get this output for libxml2 2.9.8:

11080 tags when fed bytes
557 tags when fed unicode

and this for libxml2 2.9.7 (and 2.9.3):

11080 tags when fed bytes
11080 tags when fed unicode

This suggests that there is a ... difference in behaviour ... in libxml2 2.9.8.
I don't currently have the time to check if it's a known bug or if there is a workaround.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers