BS4 stops parsing after malformed tag

Bug #972524 reported by Simon Derr
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Hello,

BeautifulSoup stops parsing an html file after it encounters a malformed tag like:

<option value="13"selected>325 000</option>

The program seems to work but a part of the html file has not been parsed.

Fixing the html like this fixes the issue:

<option value="13" selected>325 000</option>

(Note the extra space before `selected')

Regards,

    Simon

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is a difference between HTML parsers, not a problem with Beautiful Soup. This test code illustrates the way different parsers handle your markup:

---
from bs4 import BeautifulSoup
data = 'before<option value="13"selected>325 000</option>after'

for parser in ('html5lib', 'lxml', 'html.parser'):
    print parser + ":", BeautifulSoup(data, parser)
---

Output:

html5lib: <html><head></head><body>before<option selected="" value="13">325 000</option>after</body></html>
lxml: <html><body><p>before<option selected="" value="13">325 000</option>after</p></body></html>
html.parser: before

---

You're using Python's built-in HTMLParser, which is known to be less lenient that html5lib or lxml.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

You'll get better results by installing html5lib or lxml.

Changed in beautifulsoup:
status: New → Invalid
Revision history for this message
Simon Derr (ddrsimon) wrote :

Wonderful. Thank you.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.