elements between head and body cause traversal to fail

Bug #1237763 reported by David Hull on 2013-10-10
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

# /usr/bin/python

# This script attempts to demonstrate what I believe is a parsing or tree
# traversal bug in Beautiful Soup 4.3.2. The SCRIPT element between the HEAD
# and BODY elements causes the children of soup.html to be (head, None, body).
# My guess is that this None element causes the Beautiful Soup's various tree
# searching functions to fail to find the body.

from bs4 import BeautifulSoup

content = """
<html>
  <head>
    <title>This is a test</title>
  </head>
  <script type="text/javascript">"hello";</script>
  <body>
    <img src="test.png" alt="This is a test" />
  </body>
</html>
"""

soup = BeautifulSoup(content, 'html5lib')

print 'head: %s' % soup.html.head # Prints head with script element moved inside head.
print 'body: %s' % soup.html.body # Prints "body: None"

# Prints: "head\nNone\nbody\n":
for tag in soup.html.children:
  print tag.name

print 'img: %s' % soup.find('img') # Prints "img: None"

Changed in beautifulsoup:
status: New → Confirmed
Leonard Richardson (leonardr) wrote :

The bug happens when a tag between <head> and <body> is moved into <head>. (script, meta, link, etc.) It does not happen if the tag gets moved into <body> (p, span, etc.)

Changed in beautifulsoup:
status: Confirmed → Fix Committed

Looks like the fix didn't work. See

 https://bugs.launchpad.net/beautifulsoup/+bug/1430633

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers