After a certain point in a Unicode document, HTMLParser stops sending tag events

Bug #1781797 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
Unassigned

Bug Description

This bug was originally reported by one of my users, here: https://bugs.launchpad.net/beautifulsoup/+bug/1762514

"LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags."

I reproduced this problem using only lxml code (attached), so I'm filing the issue here.

This looks very similar to https://bugs.launchpad.net/beautifulsoup/+bug/963880, a bug I filed against lxml six years ago. In that case the problem happened almost immediately; here it happens on a larger document.

My version information:

Python : sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml.etree : (4, 2, 3, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

This was originally reported on Python 3.6.5 and lxml 4.2.1.

Revision history for this message
Leonard Richardson (leonardr) wrote :
Revision history for this message
scoder (scoder) wrote :

Thanks for the report. I get this output for libxml2 2.9.8:

11080 tags when fed bytes
557 tags when fed unicode

and this for libxml2 2.9.7 (and 2.9.3):

11080 tags when fed bytes
11080 tags when fed unicode

This suggests that there is a ... difference in behaviour ... in libxml2 2.9.8.
I don't currently have the time to check if it's a known bug or if there is a workaround.

Revision history for this message
scoder (scoder) wrote :

Probably worth testing again, now that libxml2 2.9.9 is released.

scoder (scoder)
Changed in lxml:
status: New → Triaged
Revision history for this message
Leonard Richardson (leonardr) wrote :

I can verify that this bug was fixed sometime between lxml 4.2.5 and lxml 4.4.1.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.