Erroneous tree parsed

Bug #1522241 reported by Jason R. Coombs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Using 4.4.1 on Python 3.5.0 on Linux with a late build of lxml, I run this script:

"""
This script outputs:
[<roo/>]
roo

[<root/>]
root

[]

"""

import io
import bs4

doc = """<?xml version="1.0" encoding="us-ascii"?>
<root />
"""

def report(soup):
 elems = soup.findAll()
 print(elems)
 if elems:
  print(elems[0].name)
 print()

report(bs4.BeautifulSoup(io.StringIO(doc), 'xml'))
report(bs4.BeautifulSoup(io.StringIO(doc.replace('us-ascii', 'utf-8')), 'xml'))

doc2 = """<?xml version="1.0" encoding="us-ascii"?>
<?OFX OFXHEADER="200" VERSION="200"?>
<root />
"""
report(bs4.BeautifulSoup(io.StringIO(doc2), 'xml'))

---

as you can see, depending on the 'encoding' declaration in the header, the sole root tag might be corrupted. Furthermore, adding a second prolog causes the document to parse as empty.

Parsing these same documents with lxml directly works just fine.

Revision history for this message
Jason R. Coombs (jaraco) wrote :

With Python 3.5.1 on OS X, I no longer see the error. I suspect the issue was corrected in Python 3.5.1.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Wow, that's really weird. Since you're reporting that it's been fixed I'm going to close this bug, but I'm not sure about the fix happening in Python. This has symptoms like a lot of lxml bugs I've seen.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.