lxml

Failure to parse XML file with reversed UTF-8 BOM

Bug #1819138 reported by Anil Prasad on 2019-03-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

Hi Team,

Below is my input file:

ï»¿<?xml version="1.0" encoding="UTF-8"?>
<book>
<title>This is test file</title>
</book>

If is will try to parse this file. I have got the error 'lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1'.

This is a starting text is a BOM. If i am open this file in notepad or another editor it is not show.

If i will try to resolve this issue then remove the BOM in this file after that i will parse this file through lxml and do another things.

The issue is if i will replace this text and save this file after my file modification date is change.

I want to know how to parse this file without removing BOM.

I request to you please handle this type of situation and solve this issue as soon as possible.

Python : sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
lxml.etree : (4, 2, 5, 0)
libxml used : (2, 9, 8)
libxml compiled : (2, 9, 8)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

Regards,
Anil Prasad

Revision history for this message

scoder (scoder) wrote on 2019-03-08:

Could it be that this is an issue with your input file? Try opening it with a hexeditor to see if the first bytes are really the UTF-8 BOM: 0xEF,0xBB,0xBF. If so, please attach the file instead of copying it into the text.

Revision history for this message

scoder (scoder) wrote on 2019-03-09:

Test works, cannot reproduce.
https://github.com/lxml/lxml/blob/fd81ebb9269e5955eca8d4e9668b1a1daf9e00c0/src/lxml/tests/test_elementtree.py#L3255-L3262

Changed in lxml:
status:	New → Invalid

Revision history for this message

Anil Prasad (anilkumarg4) wrote on 2019-03-10:

test.xml Edit (92 bytes, application/xml)

I have attached the xml file for your reference.

Revision history for this message

scoder (scoder) wrote on 2019-03-12:

The BOM in that file has the wrong byte order. Instead of 0xEF, 0xBB, 0xBF and the "<", it starts with 0xBB, 0xEF, "<", 0xbf. This is invalid. See FAQ no. 5 in https://unicode.org/faq/utf_bom.html#bom5

summary:

- BOM Issue
+ Failure to parse XML file with reversed UTF-8 BOM

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

test.xml Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.