BS return broken head data
Bug #1631353 reported by
lukas
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Documentation says:
The contents of a <SCRIPT> tag should not be parsed as HTML.
but when I tried parse head where in script tag was HTML tag:
(...) h=window.
the BS return
(...) h=window.
so BS split remain head section just before `<body` text.
beautifulsoup4=
To post a comment you must log in.
Thanks for taking the time to file this bug. It's easier for me to reproduce issues like this if I have the entire document to look at. When I don't have this information, I do the best I can.
Since I don't have the document that causes this problem for you, I created a hypothetical document based on the markup you provided and the fact that you mentioned a "head section".
from bs4 import BeautifulSoup <head>< script> location. protocol+ "//",r= '<body onload="'; </head> </html> """
doc = """<html>
h=window.
</script>
I ran this document through Beautiful Soup using the lxml parser:
soup = BeautifulSoup(doc, "lxml")
print soup
Here's the output:
<html>< head><script> location. protocol+ "//",r= '<body onload="'; </head> </html>
h=window.
</script>
As you can see, I can't duplicate the problem. If you'd like to send me the full document that causes the problem for you, feel free to reopen this issue.