Beautiful Soup

strings generator improperly adds semicolon to unescaped ampersand

Bug #1685044 reported by Mike Ottum on 2017-04-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

In the below example, the strings generator replaces the character sequence `&b` with `&b;`, presumably because the ampersand is unescaped. While the HTML is invalid, character sequences of this sort often exist in HTML in the wild, and it would be best for BeautifulSoup to leave the ampersand alone.

Actual behavior:
```
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>example.com/foo?a=1&b=6</p>', 'html.parser')
>>> [s for s in soup.strings]
[u'example.com/foo?a=1&b;=6']
```

Desired behavior:
```
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>example.com/foo?a=1&b=6</p>', 'html.parser')
>>> [s for s in soup.strings]
[u'example.com/foo?a=1&b=6']
```

Revision history for this message

Mike Ottum (ottumm) wrote on 2017-04-20:

Should have mentioned, this is with BeautifulSoup 4.5.3 and Python 2.7.10.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

This behavior is specific to html.parser and I don't think there's any good way to fix it in Beautiful Soup. In these two cases, html.parser calls the same code (handle_entityref("foo")) while parsing:

&foo; then
&foo then

So there's no way to distinguish between a stray ampersand and an unrcognized entity that's missing its semicolon. We can't handle both "a=1&b=6" and "&foo;" in an intuitive way. Either both of these get semicolons at the end, or neither of them do.

Maybe we should handle "a=1&b=6" in an intuitive way and let "&foo;" not work.

Changed in beautifulsoup:
status:	New → Confirmed

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

I haven't decided what to do about this, but if I decide to handle "a=1&b=6" and break "&foo;" then I think the patch will look like this:

=== modified file 'bs4/builder/_htmlparser.py'
--- bs4/builder/_htmlparser.py 2017-05-07 01:31:10 +0000
+++ bs4/builder/_htmlparser.py 2017-05-07 02:38:30 +0000
@@ -141,7 +141,7 @@
         if character is not None:
             data = character
         else:
- data = "&%s;" % name
+ data = "&%s" % name
         self.handle_data(data)

def handle_comment(self, data):

Revision history for this message

Mike Ottum (ottumm) wrote on 2017-05-08:

Thanks for looking into this! If you have to choose one, I think it's likely better to preserve correct HTML (e.g. ' ') than to preserve incorrect HTML (e.g. 'a=b&b=c'). If I'm reading your comments correctly, that would imply that no change should be made here.

FYI, I've fixed my particular use case by switch to `html5parser`.

Leonard Richardson (leonardr) on 2018-07-15

Changed in beautifulsoup:
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.