lxml

XMLParser.feed() ignores Unicode data longer than about 512 characters

Bug #963936 reported by Leonard Richardson on 2012-03-24

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 3.0

Bug Description

I may be using the feed() interface incorrectly, but this is so random it looks like a bug.

If XMLParser.feed() receives a Unicode string longer than a certain length (for me it's 551 characters, one of my users reports 1092 characters), XMLParser does not call any of the target object's hook methods. If the same string is split into chunks of 512 characters, and the chunks passed into feed() one at a time, the hook methods are called.

The problem occurs in Python 2 and Python 3. The problem does not occur with bytestrings or when using HTMLParser.feed().

The attached script demonstrates the problem by parsing bytestring and Unicode documents of varying lengths using HTMLParser and XMLParser. In each case, the target object considers the test a success if it was notified of the start of the <root> tag. Only failures are printed.

Here are the results of running the test on Python 2.7.1:

01024 u XMLParser: Exception: internal error, line 1, column 46
02048 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
08192 u XMLParser: Exception: Document is empty, line 1, column 1
16384 u XMLParser: Exception: internal error, line 1, column 46

Here are the results on Python 3.2.0:

01024 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
16384 u XMLParser: Exception: internal error, line 1, column 46

Note that Python 3 is able to handle large Unicode strings of length 4096 and 8192--I don't know why.

The script also tests one more odd behavior I discovered, which might help isolate the problem. If I pass a large Unicode string into feed(), and then call feed() again on a very small bytestring, the large Unicode string becomes "unstuck" and hook methods are called on the target object after all.

Python 2 version info:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Python 3 version info:
Python : sys.version_info(major=3, minor=2, micro=0, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)