Fails to handle some Unicode characters (like 😺) on macOS

Bug #2019038 reported by Ludovic Rousseau
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

On macOS the following (and attached) code does not work correctly

--------------------------------------
import lxml.html
html = """<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""

parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(html, parser)
data = lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True, doctype='<!DOCTYPE html>')
print(data)
--------------------------------------

On macOS I get:
>>> print(data)
b'<!DOCTYPE html>\n<html><body><p>! D O C T Y P E h t m l &gt; \n </p></body></html>\n'

On Debian GNU/Linux I get the expected result:
>>> print(data)
b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8"></head>\n<body>\n<h1>Hello, world!</h1>\n<div>\n<p>\xf0\x9f\x98\xba</p>\n</div>\n</body>\n</html>\n'

I found the problem using Nikola and reported it at https://github.com/getnikola/nikola/issues/3686

Configuration:
Python : sys.version_info(major=3, minor=11, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 9, 2, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

I use Python from Homebrew and lxml installed in a venv.

Revision history for this message
Ludovic Rousseau (ludovic-rousseau-gmail) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.