Parser change tag name

Bug #1719051 reported by Xiaomeng Yi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

from bs4 import BeautifulSoup

content = """
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
    <links total-matched="32" records-returned="10" page-number="1">
        <link>
            <advertiser-id>4076189</advertiser-id>
        </link>
    </links>
</cj-api>
"""

root = BeautifulSoup(content, "html.parser")
print root

output:
------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
<links page-number="1" records-returned="10" total-matched="32">
<link/>
<advertiser-id>4076189</advertiser-id>
</links>
</cj-api>
------------------------------------------

why <link> is changed to <link/> and </link> disappear?
I test both on mac and ubuntu with html.parser and lxml.
The results are same.

Revision history for this message
Leonard Richardson (leonardr) wrote :

All three of the parsers you tried are HTML parsers, and they all treated the <link> tag as an HTML link tag which is not allowed to contain subtags.

The markup you have is XML. Parsing it as XML preserves the <advertiser-id> tag within the <link> tag.

Script:

content = """<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
    <links total-matched="32" records-returned="10" page-number="1">
        <link>
            <advertiser-id>4076189</advertiser-id>
        </link>
    </links>
</cj-api>
"""
from bs4 import BeautifulSoup
print BeautifulSoup(content, "xml")

output:

<?xml version="1.0" encoding="utf-8"?>
<cj-api>
<links page-number="1" records-returned="10" total-matched="32">
<link>
<advertiser-id>4076189</advertiser-id>
</link>
</links>
</cj-api>

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.