strings generator improperly adds semicolon to unescaped ampersand

Bug #1685044 reported by Mike Ottum
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

In the below example, the strings generator replaces the character sequence `&b` with `&b;`, presumably because the ampersand is unescaped. While the HTML is invalid, character sequences of this sort often exist in HTML in the wild, and it would be best for BeautifulSoup to leave the ampersand alone.

Actual behavior:
```
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>example.com/foo?a=1&b=6</p>', 'html.parser')
>>> [s for s in soup.strings]
[u'example.com/foo?a=1&b;=6']
```

Desired behavior:
```
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>example.com/foo?a=1&b=6</p>', 'html.parser')
>>> [s for s in soup.strings]
[u'example.com/foo?a=1&b=6']
```

Revision history for this message
Mike Ottum (ottumm) wrote :

Should have mentioned, this is with BeautifulSoup 4.5.3 and Python 2.7.10.

Revision history for this message
Leonard Richardson (leonardr) wrote :

This behavior is specific to html.parser and I don't think there's any good way to fix it in Beautiful Soup. In these two cases, html.parser calls the same code (handle_entityref("foo")) while parsing:

&foo; then
&foo then

So there's no way to distinguish between a stray ampersand and an unrcognized entity that's missing its semicolon. We can't handle both "a=1&b=6" and "&foo;" in an intuitive way. Either both of these get semicolons at the end, or neither of them do.

Maybe we should handle "a=1&b=6" in an intuitive way and let "&foo;" not work.

Changed in beautifulsoup:
status: New → Confirmed
Revision history for this message
Leonard Richardson (leonardr) wrote :

I haven't decided what to do about this, but if I decide to handle "a=1&b=6" and break "&foo;" then I think the patch will look like this:

=== modified file 'bs4/builder/_htmlparser.py'
--- bs4/builder/_htmlparser.py 2017-05-07 01:31:10 +0000
+++ bs4/builder/_htmlparser.py 2017-05-07 02:38:30 +0000
@@ -141,7 +141,7 @@
         if character is not None:
             data = character
         else:
- data = "&%s;" % name
+ data = "&%s" % name
         self.handle_data(data)

     def handle_comment(self, data):

Revision history for this message
Mike Ottum (ottumm) wrote :

Thanks for looking into this! If you have to choose one, I think it's likely better to preserve correct HTML (e.g. '&nbsp;') than to preserve incorrect HTML (e.g. 'a=b&b=c'). If I'm reading your comments correctly, that would imply that no change should be made here.

FYI, I've fixed my particular use case by switch to `html5parser`.

Changed in beautifulsoup:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.