XMLParser declines to parse Unicode string that begins with BYTE ORDER MARK

Bug #1948551 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

If I feed an XMLParser a Unicode string that begins with the BYTE ORDER MARK character, XMLParser does not call any of the target object's hook methods. With an equivalent bytestring, or a Unicode string that doesn't begin with BYTE ORDER MARK, I get the behavior I'd expect.

This ticket comes from bug #1947768, which was filed against my Beautiful Soup project. lxml's behavior here is similar to bug #963936 (hook methods are called for a bytestring but not a Unicode string), but not exactly the same. This also similar to bug #1463610, but I'm pretty sure it's not the same as that bug.

Output of the attached test script:

Bytes without BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Bytes with BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Bytes without BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Bytes with BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Unicode without BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Unicode with BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Unicode without BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Unicode with BOM, parser=<class 'lxml.etree.XMLParser'>: Exception: Document is empty, line 1, column 1 (<string>, line 1)

Python : sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message
Leonard Richardson (leonardr) wrote :
summary: - XMLParser declines to parse Unicode string that begins with BOM
+ XMLParser declines to parse Unicode string that begins with BYTE ORDER
+ MARK
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.