XMLParser.feed() ignores Unicode data longer than about 512 characters

Bug #963936 reported by Leonard Richardson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

I may be using the feed() interface incorrectly, but this is so random it looks like a bug.

If XMLParser.feed() receives a Unicode string longer than a certain length (for me it's 551 characters, one of my users reports 1092 characters), XMLParser does not call any of the target object's hook methods. If the same string is split into chunks of 512 characters, and the chunks passed into feed() one at a time, the hook methods are called.

The problem occurs in Python 2 and Python 3. The problem does not occur with bytestrings or when using HTMLParser.feed().

The attached script demonstrates the problem by parsing bytestring and Unicode documents of varying lengths using HTMLParser and XMLParser. In each case, the target object considers the test a success if it was notified of the start of the <root> tag. Only failures are printed.

Here are the results of running the test on Python 2.7.1:

01024 u XMLParser: Exception: internal error, line 1, column 46
02048 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
08192 u XMLParser: Exception: Document is empty, line 1, column 1
16384 u XMLParser: Exception: internal error, line 1, column 46

Here are the results on Python 3.2.0:

01024 u XMLParser: Exception: internal error, line 1, column 46
04096 u XMLParser: Exception: internal error, line 1, column 46
16384 u XMLParser: Exception: internal error, line 1, column 46

Note that Python 3 is able to handle large Unicode strings of length 4096 and 8192--I don't know why.

The script also tests one more odd behavior I discovered, which might help isolate the problem. If I pass a large Unicode string into feed(), and then call feed() again on a very small bytestring, the large Unicode string becomes "unstuck" and hook methods are called on the target object after all.

Python 2 version info:
Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Python 3 version info:
Python : sys.version_info(major=3, minor=2, micro=0, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Leonard Richardson (leonardr) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

Bug 972466 may be related.

Revision history for this message
scoder (scoder) wrote :

Thanks for the report, I can reproduce this.

Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

I mentioned earlier that bug 972466 may be related to this one. One of my users just filed bug 1034883, which is very similar to bug 972466, except that the problem is triggered by the 'encoding' attribute of an XML declaration, instead an encoding declared in an HTML <meta> tag.

I don't want to file a separate bug for the encoding problem because I don't have any way to factor out the effects of this bug. I suspect they are symptoms of the same underlying problem. Anyway, I thought you'd like to know; I hope this helps you find the problem.

Revision history for this message
scoder (scoder) wrote :

Thanks for the reminder. I looked into it and it turns out that this was due to some unexpected behaviour of libxml2. The function that resets the parser context status allows passing in initial data to start parsing right away. However, if you pass in an encoding at the same time, it will parse the data and only set up the encoding afterwards, which seems to have some impact on the buffer management. *Very* unexpected behaviour.

This specifically hits Unicode parsing because lxml parses Unicode strings right from the internal buffer by passing in the underlying platform encoding.

Here is the straight forward work around:

https://github.com/lxml/lxml/commit/095468bb154c5b76eac19f05c799bcb5d7a7de40

I also think this is worth fixing in 2.3.x at some point.

Changed in lxml:
status: Confirmed → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

Great, thanks for looking into this.

Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.0 alpha 2 and 2.3.6.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 3.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.