BS return broken head data

Bug #1631353 reported by lukas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Documentation says:

The contents of a <SCRIPT> tag should not be parsed as HTML.

but when I tried parse head where in script tag was HTML tag:

(...) h=window.location.protocol+"//",r='<body onload="'; (...)

the BS return

(...) h=window.location.protocol+"//",r='</script></head></html>

so BS split remain head section just before `<body` text.

beautifulsoup4==4.5.1 + lxml==3.6.4 + python 2.7

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this bug. It's easier for me to reproduce issues like this if I have the entire document to look at. When I don't have this information, I do the best I can.

Since I don't have the document that causes this problem for you, I created a hypothetical document based on the markup you provided and the fact that you mentioned a "head section".

from bs4 import BeautifulSoup
doc = """<html><head><script>
h=window.location.protocol+"//",r='<body onload="';
</script></head></html>"""

I ran this document through Beautiful Soup using the lxml parser:

soup = BeautifulSoup(doc, "lxml")
print soup

Here's the output:

<html><head><script>
h=window.location.protocol+"//",r='<body onload="';
</script></head></html>

As you can see, I can't duplicate the problem. If you'd like to send me the full document that causes the problem for you, feel free to reopen this issue.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.