Smart quotes are inconsistently converted to Unicode characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Taken from Guillaume Lepert's report at https:/
###
import bs4
a = b'<html>
print bs4.BeautifulSo
###
"Now what's weird here is that the smart codes have been correctly transcoded in utf-8; however the HTML escaped sequences are mangled: \xc2\x93 is not a valid UTF-8 codepoint; but \x93 is the correct windows-1252 codepoint....
So somehow the escaped sequences have been - correctly - transcoded to windows-1252, but then incorrectly translated to UTF-8..."
Changed in beautifulsoup: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
HTML numeric entities are supposed to reference Unicode code points, but “ references a Windows-1252 code point.
html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.