Beautiful Soup

Any parser can't parse <area> tag with contents using any parser

Bug #1928742 reported by Mikhail Yudin on 2021-05-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

<area> tag closed early:

Python 3.9.4 (default, Apr 20 2021, 15:51:38)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<area><test>123</test></area>','lxml')
>>> soup.prettify()
'<html>\n <body>\n <area/>\n <test>\n 123\n </test>\n </body>\n</html>'
>>>

Same with html.parser and html5lib.

Revision history for this message

Mikhail Yudin (fagci) wrote on 2021-05-18:

Additional information:

$ pip install beautifulsoup4 --upgrade
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: beautifulsoup4 in /usr/lib/python3.9/site-packages (4.9.3)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3.9/site-packages (from beautifulsoup4) (2.2.1)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-05-18:

The behavior you're seeing is by design. The HTML <area> tag is an empty-element tag -- it has a content model of "nothing" (https://html.spec.whatwg.org/multipage/image-maps.html#the-area-element) so it can't contain any other tags.

You'll get similar behavior if you use lxml or html5lib alone, without Beautiful Soup:

markup = '<area><test>123</test></area>'
from lxml import etree
root = etree.HTML(markup)
print(etree.tostring(root))
# <html><body><p>'<area/><test>123</test>'</p></body></html>

import html5lib
element = html5lib.parse(markup)
walker = html5lib.getTreeWalker("etree")
stream = walker(element)
s = html5lib.serializer.HTMLSerializer()
print("".join([x for x in s.serialize(stream)]))
# <div><area><test>123</test></div>

There are two main ways around this.

First, you can parse the markup as XML instead of HTML. <area> is only an empty-element tag by the rules of the HTML5 spec. By default, Beautiful Soup's XML parser does not think any tags are empty-element tags.

from bs4 import BeautifulSoup
print(BeautifulSoup(markup, 'xml'))
# <area><test>123</test></area>

If you want to use most of the rules of HTML, but relax this one, you can subclass one of the TreeBuilder classes and override its "empty_element_tags" attribute to stop <area> from being considered an empty-element tag:

from bs4.builder import HTMLParserTreeBuilder
class MyHTMLParserTreeBuilder(HTMLParserTreeBuilder):
empty_element_tags = HTMLParserTreeBuilder.empty_element_tags - set(["area"])
print(BeautifulSoup(markup, builder=MyHTMLParserTreeBuilder()))
# <area><test>123</test></area>

Changed in beautifulsoup:
status:	New → Invalid

Revision history for this message

Mikhail Yudin (fagci) wrote on 2021-05-18:

I think, lxml parses xml strictly. But thank you, using xml now, and all is ok.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.