Beautiful Soup

Erroneous tree parsed

Bug #1522241 reported by Jason R. Coombs on 2015-12-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

Using 4.4.1 on Python 3.5.0 on Linux with a late build of lxml, I run this script:

"""
This script outputs:
[<roo/>]
roo

[<root/>]
root

[]

"""

import io
import bs4

doc = """<?xml version="1.0" encoding="us-ascii"?>
<root />
"""

def report(soup):
elems = soup.findAll()
print(elems)
if elems:
print(elems[0].name)
print()

report(bs4.BeautifulSoup(io.StringIO(doc), 'xml'))
report(bs4.BeautifulSoup(io.StringIO(doc.replace('us-ascii', 'utf-8')), 'xml'))

doc2 = """<?xml version="1.0" encoding="us-ascii"?>
<?OFX OFXHEADER="200" VERSION="200"?>
<root />
"""
report(bs4.BeautifulSoup(io.StringIO(doc2), 'xml'))

---

as you can see, depending on the 'encoding' declaration in the header, the sole root tag might be corrupted. Furthermore, adding a second prolog causes the document to parse as empty.

Parsing these same documents with lxml directly works just fine.

Revision history for this message

Jason R. Coombs (jaraco) wrote on 2016-03-26:

With Python 3.5.1 on OS X, I no longer see the error. I suspect the issue was corrected in Python 3.5.1.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-07-17:

Wow, that's really weird. Since you're reporting that it's been fixed I'm going to close this bug, but I'm not sure about the fix happening in Python. This has symptoms like a lot of lxml bugs I've seen.

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.