lxml

XMLParser declines to parse Unicode string that begins with BYTE ORDER MARK

Bug #1948551 reported by Leonard Richardson on 2021-10-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

If I feed an XMLParser a Unicode string that begins with the BYTE ORDER MARK character, XMLParser does not call any of the target object's hook methods. With an equivalent bytestring, or a Unicode string that doesn't begin with BYTE ORDER MARK, I get the behavior I'd expect.

This ticket comes from bug #1947768, which was filed against my Beautiful Soup project. lxml's behavior here is similar to bug #963936 (hook methods are called for a bytestring but not a Unicode string), but not exactly the same. This also similar to bug #1463610, but I'm pretty sure it's not the same as that bug.

Output of the attached test script:

Bytes without BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Bytes with BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Bytes without BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Bytes with BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Unicode without BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Unicode with BOM, parser=<class 'lxml.etree.HTMLParser'>: Success
Unicode without BOM, parser=<class 'lxml.etree.XMLParser'>: Success
Unicode with BOM, parser=<class 'lxml.etree.XMLParser'>: Exception: Document is empty, line 1, column 1 (<string>, line 1)

Python : sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)