Unknown Status Keyword in ParserBase raises Python Not Implemented

Bug #1708831 reported by Adam York
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

I am using Python 3.6 I do have lxml 3.8.0 installed
I'm using requests to get response content and parsing with soup.
Soup is set to use "html.parser"
I'm sorry I don't know what URL caused the issue because it runs as a batch process that I'm using to crawl some web sites. I will try different parser settings to see if I get the same error or not.

The implementation of soup is here in this code snip.

        response = requests.get(url,
                                headers=settings.crawler.headers,
                                verify=False,
                                allow_redirects=True,
                                timeout=settings.crawler.http_requests_time_out)
        if response.status_code == requests.codes.ok:
            soup = BeautifulSoup(response.content, "html.parser")

---------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "crawler/__main__.py", line 187, in <module>
    main()
  File "crawler/__main__.py", line 184, in main
    crawl_news_and_media()
  File "crawler/__main__.py", line 23, in crawl_news_and_media
    start_domain_crawl(topic)
  File "crawler/__main__.py", line 45, in start_domain_crawl
    result = get_domain_resources(url)
  File "/home/adam/PycharmProjects/axilBitsSearchProject/axilbits_search/crawler/domain.py", line 25, in get_domain_resources
    soup = BeautifulSoup(response.content, "html.parser")
  File "/usr/local/lib/python3.6/site-packages/bs4/__init__.py", line 228, in __init__
    self._feed()
  File "/usr/local/lib/python3.6/site-packages/bs4/__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python3.6/site-packages/bs4/builder/_htmlparser.py", line 215, in feed
    parser.feed(markup)
  File "/usr/local/lib/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/local/lib/python3.6/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/local/lib/python3.6/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/local/lib/python3.6/_markupbase.py", line 159, in parse_marked_section
    self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
  File "/usr/local/lib/python3.6/_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()

Revision history for this message
Leonard Richardson (leonardr) wrote :

A number of others have reported this problem in the past year (https://groups.google.com/forum/#!topic/beautifulsoup/EFNH2oxOX4A, https://stackoverflow.com/questions/49786893/python-beautifulsoup-error-while-scraping) but none of the reports included the specific markup that caused the problem, and I haven't been able to able to duplicate it. However the solution is pretty clear -- BeautifulSoupHTMLParser should implement error() and do something with the error message rather than raise an exception.

This change is in revision 454. Since I can't reproduce the issue I can't guarantee that Beautiful Soup will turn such a document into anything useful, but it will no longer raise an exception.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.