XMLParser declines to parse Unicode string that begins with BYTE ORDER MARK
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
If I feed an XMLParser a Unicode string that begins with the BYTE ORDER MARK character, XMLParser does not call any of the target object's hook methods. With an equivalent bytestring, or a Unicode string that doesn't begin with BYTE ORDER MARK, I get the behavior I'd expect.
This ticket comes from bug #1947768, which was filed against my Beautiful Soup project. lxml's behavior here is similar to bug #963936 (hook methods are called for a bytestring but not a Unicode string), but not exactly the same. This also similar to bug #1463610, but I'm pretty sure it's not the same as that bug.
Output of the attached test script:
Bytes without BOM, parser=<class 'lxml.etree.
Bytes with BOM, parser=<class 'lxml.etree.
Bytes without BOM, parser=<class 'lxml.etree.
Bytes with BOM, parser=<class 'lxml.etree.
Unicode without BOM, parser=<class 'lxml.etree.
Unicode with BOM, parser=<class 'lxml.etree.
Unicode without BOM, parser=<class 'lxml.etree.
Unicode with BOM, parser=<class 'lxml.etree.
Python : sys.version_
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)