elements between head and body cause traversal to fail
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
# /usr/bin/python
# This script attempts to demonstrate what I believe is a parsing or tree
# traversal bug in Beautiful Soup 4.3.2. The SCRIPT element between the HEAD
# and BODY elements causes the children of soup.html to be (head, None, body).
# My guess is that this None element causes the Beautiful Soup's various tree
# searching functions to fail to find the body.
from bs4 import BeautifulSoup
content = """
<html>
<head>
<title>This is a test</title>
</head>
<script type="text/
<body>
<img src="test.png" alt="This is a test" />
</body>
</html>
"""
soup = BeautifulSoup(
print 'head: %s' % soup.html.head # Prints head with script element moved inside head.
print 'body: %s' % soup.html.body # Prints "body: None"
# Prints: "head\nNone\
for tag in soup.html.children:
print tag.name
print 'img: %s' % soup.find('img') # Prints "img: None"
Changed in beautifulsoup: | |
status: | New → Confirmed |
Changed in beautifulsoup: | |
status: | Confirmed → Fix Committed |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
The bug happens when a tag between <head> and <body> is moved into <head>. (script, meta, link, etc.) It does not happen if the tag gets moved into <body> (p, span, etc.)