Beautiful Soup

Parser change tag name

Bug #1719051 reported by Xiaomeng Yi on 2017-09-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

from bs4 import BeautifulSoup

content = """
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
    <links total-matched="32" records-returned="10" page-number="1">
        <link>
            <advertiser-id>4076189</advertiser-id>
        </link>
    </links>
</cj-api>
"""

root = BeautifulSoup(content, "html.parser")
print root

output:
------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
<links page-number="1" records-returned="10" total-matched="32">
<link/>
<advertiser-id>4076189</advertiser-id>
</links>
</cj-api>
------------------------------------------

why <link> is changed to <link/> and </link> disappear?
I test both on mac and ubuntu with html.parser and lxml.
The results are same.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-14:

All three of the parsers you tried are HTML parsers, and they all treated the <link> tag as an HTML link tag which is not allowed to contain subtags.

The markup you have is XML. Parsing it as XML preserves the <advertiser-id> tag within the <link> tag.

Script:

content = """<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
    <links total-matched="32" records-returned="10" page-number="1">
        <link>
            <advertiser-id>4076189</advertiser-id>
        </link>
    </links>
</cj-api>
"""
from bs4 import BeautifulSoup
print BeautifulSoup(content, "xml")

output:

<?xml version="1.0" encoding="utf-8"?>
<cj-api>
<links page-number="1" records-returned="10" total-matched="32">
<link>
<advertiser-id>4076189</advertiser-id>
</link>
</links>
</cj-api>

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.