Beautiful Soup

Smart quotes are inconsistently converted to Unicode characters

Bug #1782933 reported by Leonard Richardson on 2018-07-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Medium	Unassigned

Bug Description

Taken from Guillaume Lepert's report at https://groups.google.com/forum/#!topic/beautifulsoup/laZLrma_W5I

###
import bs4
a = b'<html><head>\n<title>Message: Our Lines Been Changed Again</title>\n</head>\n<p>Message: Our Lines Been Changed Again</p>\n<p>But... \x93What Does It Mean?\x97Not Very Much.\x94 </p\n</body>\n</html>\n'

print bs4.BeautifulSoup(a, from_encoding='windows-1252').prettify('utf-8')
###

"Now what's weird here is that the smart codes have been correctly transcoded in utf-8; however the HTML escaped sequences are mangled: \xc2\x93 is not a valid UTF-8 codepoint; but \x93 is the correct windows-1252 codepoint....

So somehow the escaped sequences have been - correctly - transcoded to windows-1252, but then incorrectly translated to UTF-8..."

Leonard Richardson (leonardr) on 2018-07-21

Changed in beautifulsoup:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-28:

HTML numeric entities are supposed to reference Unicode code points, but  references a Windows-1252 code point.

html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.

Changed in beautifulsoup:
status:	Confirmed → Fix Committed

Leonard Richardson (leonardr) on 2018-07-28

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.