HTML numeric entities are supposed to reference Unicode code points, but “ references a Windows-1252 code point.
html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.
HTML numeric entities are supposed to reference Unicode code points, but “ references a Windows-1252 code point.
html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.