lxml

After a certain point in a Unicode document, HTMLParser stops sending tag events

Bug #1781797 reported by Leonard Richardson on 2018-07-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Undecided	Unassigned

Bug Description

This bug was originally reported by one of my users, here: https://bugs.launchpad.net/beautifulsoup/+bug/1762514

"LXML parser incorrectly parses strings longer than 2^14 characters (but correctly parses the same strings when encoded to bytes). The string after the 16384'th character is treated as individual characters, rather than words and tags."

I reproduced this problem using only lxml code (attached), so I'm filing the issue here.

This looks very similar to https://bugs.launchpad.net/beautifulsoup/+bug/963880, a bug I filed against lxml six years ago. In that case the problem happened almost immediately; here it happens on a larger document.

My version information:

Python : sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml.etree : (4, 2, 3, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

This was originally reported on Python 3.6.5 and lxml 4.2.1.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-15:

test_17625144.py Edit (856 bytes, text/x-python)

Revision history for this message

scoder (scoder) wrote on 2018-07-16:

Thanks for the report. I get this output for libxml2 2.9.8:

11080 tags when fed bytes
557 tags when fed unicode

and this for libxml2 2.9.7 (and 2.9.3):

11080 tags when fed bytes
11080 tags when fed unicode

This suggests that there is a ... difference in behaviour ... in libxml2 2.9.8.
I don't currently have the time to check if it's a known bug or if there is a workaround.

Revision history for this message

scoder (scoder) wrote on 2019-01-29:

Probably worth testing again, now that libxml2 2.9.9 is released.

scoder (scoder) on 2019-08-11