Beautiful Soup

BS4 stops parsing after malformed tag

Bug #972524 reported by Simon Derr on 2012-04-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

Hello,

BeautifulSoup stops parsing an html file after it encounters a malformed tag like:

The program seems to work but a part of the html file has not been parsed.

Fixing the html like this fixes the issue:

(Note the extra space before `selected')

Regards,

Simon

Revision history for this message

Leonard Richardson (leonardr) wrote on 2012-04-03:

This is a difference between HTML parsers, not a problem with Beautiful Soup. This test code illustrates the way different parsers handle your markup:

---
from bs4 import BeautifulSoup
data = 'before<option value="13"selected>325 000</option>after'

for parser in ('html5lib', 'lxml', 'html.parser'):
print parser + ":", BeautifulSoup(data, parser)
---

Output:

html5lib: <html><head></head><body>before<option selected="" value="13">325 000</option>after</body></html>
lxml: <html><body><p>before<option selected="" value="13">325 000</option>after</p></body></html>
html.parser: before

---

You're using Python's built-in HTMLParser, which is known to be less lenient that html5lib or lxml.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

You'll get better results by installing html5lib or lxml.

Changed in beautifulsoup:
status:	New → Invalid

Revision history for this message

Simon Derr (ddrsimon) wrote on 2012-04-03:

Wonderful. Thank you.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.