Erroneous tree parsed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Using 4.4.1 on Python 3.5.0 on Linux with a late build of lxml, I run this script:
"""
This script outputs:
[<roo/>]
roo
[<root/>]
root
[]
"""
import io
import bs4
doc = """<?xml version="1.0" encoding=
<root />
"""
def report(soup):
elems = soup.findAll()
print(elems)
if elems:
print(
print()
report(
report(
doc2 = """<?xml version="1.0" encoding=
<?OFX OFXHEADER="200" VERSION="200"?>
<root />
"""
report(
---
as you can see, depending on the 'encoding' declaration in the header, the sole root tag might be corrupted. Furthermore, adding a second prolog causes the document to parse as empty.
Parsing these same documents with lxml directly works just fine.
With Python 3.5.1 on OS X, I no longer see the error. I suspect the issue was corrected in Python 3.5.1.