BS4 stops parsing after malformed tag
Bug #972524 reported by
Simon Derr
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Hello,
BeautifulSoup stops parsing an html file after it encounters a malformed tag like:
<option value="
The program seems to work but a part of the html file has not been parsed.
Fixing the html like this fixes the issue:
<option value="13" selected>325 000</option>
(Note the extra space before `selected')
Regards,
Simon
To post a comment you must log in.
This is a difference between HTML parsers, not a problem with Beautiful Soup. This test code illustrates the way different parsers handle your markup:
--- 13"selected> 325 000</option>after'
from bs4 import BeautifulSoup
data = 'before<option value="
for parser in ('html5lib', 'lxml', 'html.parser'):
print parser + ":", BeautifulSoup(data, parser)
---
Output:
html5lib: <html>< head></ head><body> before< option selected="" value="13">325 000</option> after</ body></ html> body><p> before< option selected="" value="13">325 000</option> after</ p></body> </html>
lxml: <html><
html.parser: before
---
You're using Python's built-in HTMLParser, which is known to be less lenient that html5lib or lxml.
http:// www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #installing- a-parser
You'll get better results by installing html5lib or lxml.