Fails to handle some Unicode characters (like 😺) on macOS
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
On macOS the following (and attached) code does not work correctly
-------
import lxml.html
html = """<!DOCTYPE html>
<head><meta charset=
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""
parser = lxml.html.
doc = lxml.html.
data = lxml.html.
print(data)
-------
On macOS I get:
>>> print(data)
b'<!DOCTYPE html>\n<
On Debian GNU/Linux I get the expected result:
>>> print(data)
b'<!DOCTYPE html>\n<
I found the problem using Nikola and reported it at https:/
Configuration:
Python : sys.version_
lxml.etree : (4, 9, 2, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
I use Python from Homebrew and lxml installed in a venv.