When html.parser gives up on parsing, Beautiful Soup doesn't handle it correctly
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
The crash stack information is as follows:
[1676443521] === Uncaught Python exception: ===
[1676443521] NotImplementedE
[1676443521] Traceback (most recent call last):
[1676443521] File "/home/
[1676443521] instance = BeautifulSoup(
[1676443521] File "/home/
[1676443521] self._feed()
[1676443521] File "/home/
[1676443521] self.builder.
[1676443521] File "/home/
[1676443521] parser.feed(markup)
[1676443521] File "/usr/lib/
[1676443521] self.goahead(0)
[1676443521] File "/usr/lib/
[1676443521] k = self.parse_
[1676443521] File "/usr/lib/
[1676443521] return self.parse_
[1676443521] File "/usr/lib/
[1676443521] self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
[1676443521] File "/usr/lib/
[1676443521] raise NotImplementedE
[1676443521] NotImplementedE
Bug summary:
This happened in version 4.11.2.
BeautifulSoup4() inherits the HTMLParser() class, but does not overload error() for this class.
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Thanks for filing this issue. This was my mistake: I was removing code that was only needed in versions of Python that Beautiful Soup doesn't support. HTMLParser.error() was removed in Python 3.5, but its superclass *also* implements the same method, and that method wasn't removed until Python 3.10 (see https:/ /github. com/python/ cpython/ issues/ 76025).
However, it probably doesn't make a difference, because in every case I've seen, suppressing these errors just means that the bad markup crashes Python's html.parser immediately after the error() call returns, due to either https:/ /github. com/python/ cpython/ issues/ 81928 or https:/ /github. com/python/ cpython/ issues/ 78661. Every fuzzer issue I've seen so far turned out to be one of those two CPython issues.
I'll replace the error() implementation in the next version, and I'll probably have it raise ParserRejectedM arkup instead of suppressing the error, so that the cause of the problem is more clear.