Failure to parse XML file with reversed UTF-8 BOM

Bug #1819138 reported by Anil Prasad
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Hi Team,

Below is my input file:

<?xml version="1.0" encoding="UTF-8"?>
<book>
<title>This is test file</title>
</book>

If is will try to parse this file. I have got the error 'lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1'.

This is a starting text is a BOM. If i am open this file in notepad or another editor it is not show.

If i will try to resolve this issue then remove the BOM in this file after that i will parse this file through lxml and do another things.

The issue is if i will replace this text and save this file after my file modification date is change.

I want to know how to parse this file without removing BOM.

I request to you please handle this type of situation and solve this issue as soon as possible.

Python : sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (4, 2, 5, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

Regards,
Anil Prasad

Revision history for this message
scoder (scoder) wrote :

Could it be that this is an issue with your input file? Try opening it with a hexeditor to see if the first bytes are really the UTF-8 BOM: 0xEF,0xBB,0xBF. If so, please attach the file instead of copying it into the text.

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
status: New → Invalid
Revision history for this message
Anil Prasad (anilkumarg4) wrote :

I have attached the xml file for your reference.

Revision history for this message
scoder (scoder) wrote :

The BOM in that file has the wrong byte order. Instead of 0xEF, 0xBB, 0xBF and the "<", it starts with 0xBB, 0xEF, "<", 0xbf. This is invalid. See FAQ no. 5 in https://unicode.org/faq/utf_bom.html#bom5

summary: - BOM Issue
+ Failure to parse XML file with reversed UTF-8 BOM
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.