When html.parser gives up on parsing, Beautiful Soup doesn't handle it correctly

Bug #2007343 reported by adashwy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

The crash stack information is as follows:

[1676443521] === Uncaught Python exception: ===
[1676443521] NotImplementedError: subclasses of ParserBase must override error()
[1676443521] Traceback (most recent call last):
[1676443521] File "/home/server1/adashwy/DriverCollections/exp_drivers/pyfuzzgen_drivers/bs4/beautifulsoup_driver/beautifulsoup_driver.py", line 40, in TestOneInput
[1676443521] instance = BeautifulSoup(remaining_data, features=parsers[idx])
[1676443521] File "/home/server1/.local/lib/python3.8/site-packages/bs4/__init__.py", line 333, in __init__
[1676443521] self._feed()
[1676443521] File "/home/server1/.local/lib/python3.8/site-packages/bs4/__init__.py", line 452, in _feed
[1676443521] self.builder.feed(self.markup)
[1676443521] File "/home/server1/.local/lib/python3.8/site-packages/bs4/builder/_htmlparser.py", line 362, in feed
[1676443521] parser.feed(markup)
[1676443521] File "/usr/lib/python3.8/html/parser.py", line 111, in feed
[1676443521] self.goahead(0)
[1676443521] File "/usr/lib/python3.8/html/parser.py", line 179, in goahead
[1676443521] k = self.parse_html_declaration(i)
[1676443521] File "/usr/lib/python3.8/html/parser.py", line 264, in parse_html_declaration
[1676443521] return self.parse_marked_section(i)
[1676443521] File "/usr/lib/python3.8/_markupbase.py", line 159, in parse_marked_section
[1676443521] self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
[1676443521] File "/usr/lib/python3.8/_markupbase.py", line 33, in error
[1676443521] raise NotImplementedError(
[1676443521] NotImplementedError: subclasses of ParserBase must override error()

Bug summary:
This happened in version 4.11.2.
BeautifulSoup4() inherits the HTMLParser() class, but does not overload error() for this class.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for filing this issue. This was my mistake: I was removing code that was only needed in versions of Python that Beautiful Soup doesn't support. HTMLParser.error() was removed in Python 3.5, but its superclass *also* implements the same method, and that method wasn't removed until Python 3.10 (see https://github.com/python/cpython/issues/76025).

However, it probably doesn't make a difference, because in every case I've seen, suppressing these errors just means that the bad markup crashes Python's html.parser immediately after the error() call returns, due to either https://github.com/python/cpython/issues/81928 or https://github.com/python/cpython/issues/78661. Every fuzzer issue I've seen so far turned out to be one of those two CPython issues.

I'll replace the error() implementation in the next version, and I'll probably have it raise ParserRejectedMarkup instead of suppressing the error, so that the cause of the problem is more clear.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision e0bbee7 consistently raises ParserRejectedMarkup when this kind of problem happens, regardless of the Python version.

Changed in beautifulsoup:
status: New → Fix Committed
summary: - No overriding parent class method in version 4.11.2
+ When html.parser gives up on parsing, Beautiful Soup doesn't handle it
+ correctly
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.