Parser change tag name
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
from bs4 import BeautifulSoup
content = """
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
<links total-matched="32" records-
<link>
</link>
</links>
</cj-api>
"""
root = BeautifulSoup(
print root
output:
-------
<?xml version="1.0" encoding="UTF-8"?>
<cj-api>
<links page-number="1" records-
<link/>
<advertiser-
</links>
</cj-api>
-------
why <link> is changed to <link/> and </link> disappear?
I test both on mac and ubuntu with html.parser and lxml.
The results are same.
All three of the parsers you tried are HTML parsers, and they all treated the <link> tag as an HTML link tag which is not allowed to contain subtags.
The markup you have is XML. Parsing it as XML preserves the <advertiser-id> tag within the <link> tag.
Script:
content = """<?xml version="1.0" encoding="UTF-8"?> returned= "10" page-number="1">
<advertise r-id>4076189< /advertiser- id> content, "xml")
<cj-api>
<links total-matched="32" records-
<link>
</link>
</links>
</cj-api>
"""
from bs4 import BeautifulSoup
print BeautifulSoup(
output:
<?xml version="1.0" encoding="utf-8"?> returned= "10" total-matched="32"> id>4076189< /advertiser- id>
<cj-api>
<links page-number="1" records-
<link>
<advertiser-
</link>
</links>
</cj-api>