Ampersands should always be escaped, even if they look like entities
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This test made sense for Beautiful Soup 3, but not 4:
----
----
Depending on how BS3 parsed a document, entities like "Á" might or might not be turned into the corresponding Unicode characters. If you saw "Á" in a document, there was no way to tell whether the original document said "Á" or "&Aacute". So we had code that only turned "&" into "&" if it looked like the "&" was not the beginning of an entity.
In Beautiful Soup 4, entities are always turned into the corresponding Unicode characters. So there's no reason not to turn "&" into "&".
The one wrinkle is entities that aren't HTML entities, like '&foo;'. You could argue that '&foo;' should come out the same way it went in, instead of being turned into "&foo;". But that's not the way it works now, and no one has complained.
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Fixed in revision 301.