Beautiful Soup

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #1782933
Comment #1

Comment 1 for bug 1782933

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-28:

HTML numeric entities are supposed to reference Unicode code points, but  references a Windows-1252 code point.

html5lib already handles these entities correctly. html.parser passes them into handle_charref(), where I've added code to handle them (revision 471). lxml converts them to (the wrong) Unicode characters -- that could be fixed in data() but that's risky because we can't distinguish that from a document where those characters were in the original.