Unsafe handling of slashes in tags (SGML short tags?)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Oh waiter - there appears to be a fly in my Beautiful Soup!
I crafted an input string that, when parsed by BeautifulSoup, produces some soup that appears to contain 1 harmless tag (according to BeautifulSoup's parse tree and programmatic interface) -- but when output to Firefox, actually contains 1 harmless tag + 1 malicious tag.
If BeautifulSoup is used to create a HTML sanitizer (which is intended to strip all tags that might contain Javascript), then this bug can cause the HTML sanitizer to become insecure in a way that no one would reasonably expect.
This is probably easiest to show by example, using BeautifulSoup in the Python interpreter:
$ python
>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup(
</SRC><IMG SRC="none.
>>> s.findAll(True)
[<img />]
>>> s.findAll(
[u'SRC><IMG SRC="none.
So, at this point BeautifulSoup's programmatic interface thinks that the soup contains one harmless IMG tag with no attributes (IMG /), and then some crazy text (which BeautifulSoup thinks does not contain any markup).
Now imagine a HTML sanitizer that iterates over the tags, removes all the nodes that contain an ONERROR attribute, and then outputs the result. Expected result: the output should not contain any markup. Actual result: the output contains markup, or at least, if you feed the output to the Firefox browser, the Firefox browser parses it in a way that finds a tag with an ONERROR attribute.
>>> for tag in s.findAll(True):
... for name, value in tag.attrs:
... if name == 'onerror':
... tag.extract()
...
>>> s
<img />SRC><IMG SRC="none.
You can see that the output from this HTML sanitizer is some strange gunk. If you feed this strange gunk into Firefox, Firefox parses this as two IMG tags. The second IMG tag has attributes SRC="none.png" and ONERROR=
The bug is that the output of BeautifulSoup is not in a form that enables us to predict accurately how it will be parsed, and as a result, BeautifulSoup's parse tree does not match Firefox's parse tree. The desired behavior is that if we have a parse tree in BeautifulSoup, then printing it out and feeding it to any modern browser's parser should yield the same parse tree (or one that is morally equivalent).
Suggestion: When BeautifulSoup converts its parse tree to a string, probably it should be applying HTML escaping to everything that BeautifulSoup thinks is a text node (at least, convert < to < and > to > and & to &), to ensure that it will be parsed in a predictable way. For similar reasons, BeautifulSoup should probably canonicalize and apply HTML escaping to comments (inside the comment, do the same three conversions, and for technical reasons, also convert - to - -- the latter is to avoid craziness related to SGML comments, nested comments, etc.)
Severity: I just analyzed a bunch of HTML sanitizers built using BeautifulSoup. I was able to break almost every one, by using the above input string. This appears likely to break almost any HTML sanitizer built on BeautifulSoup.
Alternatively, here is a different way to look at why this is undesirable. A basic invariant of BeautifulSoup should be that it is idempotent: e.g., BeautifulSoup(
>>> s = BeautifulSoup(
>>> s.renderContents()
'<img />SRC><IMG SRC="none.
>>> BeautifulSoup(
'<img />SRC><img src="none.png" onerror=
>>> s.findAll(True)
[<img />]
>>> BeautifulSoup(
[<img />, <img src="none.png" onerror=
I'm using the version of BeautifulSoup packaged by Fedora:
python-
python-
Beautiful Soup 3 doesn't encode characters like < and > on output. The good news is that Beautiful Soup 4 does, unless you deliberately prevent it.
I've changed BS3 to give strings the same treatment it gives attribute values. Angle brackets and bare ampersands will be converted to XML entities on the way out. I'll do a 3.2.1 BS3 release with just this bugfix.