Any parser can't parse <area> tag with contents using any parser

Bug #1928742 reported by Mikhail Yudin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

<area> tag closed early:

Python 3.9.4 (default, Apr 20 2021, 15:51:38)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<area><test>123</test></area>','lxml')
>>> soup.prettify()
'<html>\n <body>\n <area/>\n <test>\n 123\n </test>\n </body>\n</html>'
>>>

Same with html.parser and html5lib.

Revision history for this message
Mikhail Yudin (fagci) wrote :

Additional information:

$ pip install beautifulsoup4 --upgrade
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: beautifulsoup4 in /usr/lib/python3.9/site-packages (4.9.3)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3.9/site-packages (from beautifulsoup4) (2.2.1)

Revision history for this message
Leonard Richardson (leonardr) wrote :

The behavior you're seeing is by design. The HTML <area> tag is an empty-element tag -- it has a content model of "nothing" (https://html.spec.whatwg.org/multipage/image-maps.html#the-area-element) so it can't contain any other tags.

You'll get similar behavior if you use lxml or html5lib alone, without Beautiful Soup:

markup = '<area><test>123</test></area>'
from lxml import etree
root = etree.HTML(markup)
print(etree.tostring(root))
# <html><body><p>'<area/><test>123</test>'</p></body></html>

import html5lib
element = html5lib.parse(markup)
walker = html5lib.getTreeWalker("etree")
stream = walker(element)
s = html5lib.serializer.HTMLSerializer()
print("".join([x for x in s.serialize(stream)]))
# <div><area><test>123</test></div>

There are two main ways around this.

First, you can parse the markup as XML instead of HTML. <area> is only an empty-element tag by the rules of the HTML5 spec. By default, Beautiful Soup's XML parser does not think any tags are empty-element tags.

from bs4 import BeautifulSoup
print(BeautifulSoup(markup, 'xml'))
# <area><test>123</test></area>

If you want to use most of the rules of HTML, but relax this one, you can subclass one of the TreeBuilder classes and override its "empty_element_tags" attribute to stop <area> from being considered an empty-element tag:

from bs4.builder import HTMLParserTreeBuilder
class MyHTMLParserTreeBuilder(HTMLParserTreeBuilder):
    empty_element_tags = HTMLParserTreeBuilder.empty_element_tags - set(["area"])
print(BeautifulSoup(markup, builder=MyHTMLParserTreeBuilder()))
# <area><test>123</test></area>

Changed in beautifulsoup:
status: New → Invalid
Revision history for this message
Mikhail Yudin (fagci) wrote :

I think, lxml parses xml strictly. But thank you, using xml now, and all is ok.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.