Beautiful Soup

BS return broken head data

Bug #1631353 reported by lukas on 2016-10-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

Documentation says:

The contents of a <SCRIPT> tag should not be parsed as HTML.

but when I tried parse head where in script tag was HTML tag:

(...) h=window.location.protocol+"//",r='<body onload="'; (...)

the BS return

(...) h=window.location.protocol+"//",r='</script></head></html>

so BS split remain head section just before `<body` text.

beautifulsoup4==4.5.1 + lxml==3.6.4 + python 2.7

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-12-10:

Thanks for taking the time to file this bug. It's easier for me to reproduce issues like this if I have the entire document to look at. When I don't have this information, I do the best I can.

Since I don't have the document that causes this problem for you, I created a hypothetical document based on the markup you provided and the fact that you mentioned a "head section".

from bs4 import BeautifulSoup
doc = """<html><head><script>
h=window.location.protocol+"//",r='<body onload="';
</script></head></html>"""

I ran this document through Beautiful Soup using the lxml parser:

soup = BeautifulSoup(doc, "lxml")
print soup

Here's the output:

<html><head><script>
h=window.location.protocol+"//",r='<body onload="';
</script></head></html>

As you can see, I can't duplicate the problem. If you'd like to send me the full document that causes the problem for you, feel free to reopen this issue.

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.