Smart quotes are inconsistently converted to Unicode characters

Bug #1782933 reported by Leonard Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Medium
Unassigned

Bug Description

Taken from Guillaume Lepert's report at https://groups.google.com/forum/#!topic/beautifulsoup/laZLrma_W5I

###
import bs4
a = b'<html><head>\n<title>Message: &#147;Our Line&#146;s Been Changed Again&#148;</title>\n</head>\n<p>Message: &#147;Our Line&#146;s Been Changed Again&#148;</p>\n<p>But... \x93What Does It Mean?\x97Not Very Much.\x94 </p\n</body>\n</html>\n'

print bs4.BeautifulSoup(a, from_encoding='windows-1252').prettify('utf-8')
###

"Now what's weird here is that the smart codes have been correctly transcoded in utf-8; however the HTML escaped sequences are mangled: \xc2\x93 is not a valid UTF-8 codepoint; but \x93 is the correct windows-1252 codepoint....

So somehow the escaped sequences have been - correctly - transcoded to windows-1252, but then incorrectly translated to UTF-8..."

Changed in beautifulsoup:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Leonard Richardson (leonardr) wrote :

HTML numeric entities are supposed to reference Unicode code points, but &#147; references a Windows-1252 code point.

html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.