XMLParser.feed() ignores Unicode data longer than about 512 characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
I may be using the feed() interface incorrectly, but this is so random it looks like a bug.
If XMLParser.feed() receives a Unicode string longer than a certain length (for me it's 551 characters, one of my users reports 1092 characters), XMLParser does not call any of the target object's hook methods. If the same string is split into chunks of 512 characters, and the chunks passed into feed() one at a time, the hook methods are called.
The problem occurs in Python 2 and Python 3. The problem does not occur with bytestrings or when using HTMLParser.feed().
The attached script demonstrates the problem by parsing bytestring and Unicode documents of varying lengths using HTMLParser and XMLParser. In each case, the target object considers the test a success if it was notified of the start of the <root> tag. Only failures are printed.
Here are the results of running the test on Python 2.7.1:
01024 u XMLParser: Exception: internal error, line 1, column 46
02048 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
08192 u XMLParser: Exception: Document is empty, line 1, column 1
16384 u XMLParser: Exception: internal error, line 1, column 46
Here are the results on Python 3.2.0:
01024 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
16384 u XMLParser: Exception: internal error, line 1, column 46
Note that Python 3 is able to handle large Unicode strings of length 4096 and 8192--I don't know why.
The script also tests one more odd behavior I discovered, which might help isolate the problem. If I pass a large Unicode string into feed(), and then call feed() again on a very small bytestring, the large Unicode string becomes "unstuck" and hook methods are called on the target object after all.
Python 2 version info:
Python : sys.version_
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Python 3 version info:
Python : sys.version_
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Changed in lxml: | |
milestone: | none → 3.0 |
Bug 972466 may be related.