Comment 1 for bug 1782933

Revision history for this message
Leonard Richardson (leonardr) wrote :

HTML numeric entities are supposed to reference Unicode code points, but “ references a Windows-1252 code point.

html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.