Beautiful Soup

Unexpected exception: TypeError: cannot unpack non-iterable NoneType object

Bug #1883104 reported by jvoisin on 2020-06-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

I'm getting the following stacktrace when running the following python script on the following input on beautifulsoup4, version: 4.9.1

```
$ python3 bs4_repro.py crash-7b7ff74a3ccefdf713361731ee391c24592bd6509f257b1f98193d87b35cd6c8
/home/jvoisin/dev/pythonfuzz/ven/lib/python3.8/site-packages/bs4/builder/_htmlparser.py:102: UserWarning: expected name token at '<![- -<\x10</hlre><hr>m'
  warnings.warn(msg)
Traceback (most recent call last):
  File "bs4_repro.py", line 14, in <module>
    main()
  File "bs4_repro.py", line 12, in main
    BeautifulSoup(buf, features=parsers[idx]).prettify()
  File "/home/jvoisin/dev/pythonfuzz/ven/lib/python3.8/site-packages/bs4/__init__.py", line 345, in __init__
    self._feed()
  File "/home/jvoisin/dev/pythonfuzz/ven/lib/python3.8/site-packages/bs4/__init__.py", line 431, in _feed
    self.builder.feed(self.markup)
  File "/home/jvoisin/dev/pythonfuzz/ven/lib/python3.8/site-packages/bs4/builder/_htmlparser.py", line 377, in feed
    parser.feed(markup)
  File "/usr/lib/python3.8/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib/python3.8/html/parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib/python3.8/html/parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib/python3.8/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
TypeError: cannot unpack non-iterable NoneType object
zsh: exit 1 python3 bs4_repro.py
```

Reproducing script:

```
from bs4 import BeautifulSoup
import sys

def main ():
    with open(sys.argv[1], 'rb') as f:
        buf = f.read()
        parsers = ['lxml-xml', 'html5lib', 'html.parser', 'lxml']
        try:
            idx = int(buf[0]) % len(parsers)
        except ValueError:
            return
        BeautifulSoup(buf, features=parsers[idx]).prettify()

main()

```

Input file (use `xxd -r` to transform the hexdump into a file):

```
$xxd crash-7b7ff74a3ccefdf713361731ee391c24592bd6509f257b1f98193d87b35cd6c8
00000000: 0a68 745c 6e74 6f75 6368 656e 646d 6c3e .ht\ntouchendml>
00000010: 3c42 6f64 793e 7fff ffff 643e 0002 693e <Body>....d>..i>
00000020: 2d75 6c59 743c 7472 743e 3c3c 6474 3e3e -ulYt<trt><<dt>>
00000030: 3cd5 7265 3c2f 6c69 3e3c 215b 2d20 2d3c <.re</li><![- -<
00000040: 103c 2f68 6c72 653e 3c68 723e 6d6c 6f6e .</hlre><hr>mlon
00000050: 7265 7365 743e 0a3c 6474 3e3c 70ae 653e reset>.<dt><p.e>
00000060: 3c68 723e
```

See original description

Tags:

Revision history for this message

jvoisin (julien-voisin) wrote on 2020-06-11:

Crashing file Edit (100 bytes, application/octet-stream)

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-06-11:

Thanks for taking the time to file this issue. The problem here is in Python 3's html.parser library. Here's a script that duplicates the error without using any Beautiful Soup code:

---
from html.parser import HTMLParser
import warnings

bad_markup = '\nht\\ntouchendml><Body>\x7fÿÿÿd>\x00\x02i>-ulYt<trt><<dt>><Õre</li><![- -<\x10</hlre><hr>mlonreset>\n<dt><p®e><hr>'

class MyParser(HTMLParser):
def error(self, msg):
warnings.warn(msg)

parser = MyParser()
parser.feed(bad_markup)
---

Someone else filed this issue against Python last year: https://bugs.python.org/issue37747

I've updated it with a link to this ticket and a copy of my duplication script.