lxml on mac os cannot parse html that contains emojis
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Low
|
scoder |
Bug Description
```html
<html>
<head>
<title>随机 Emoji 示例</title>
</head>
<body>
<p id="emojiParagr
</body>
</html>
```
For html with emoji above it, lxml returns None
```python
from lxml import etree
from mark import BASE_DIR
with open(BASE_
dom = etree.HTML(
print(dom)
```
The output is as follows:
```shell
None
```
If I delete emojis
```html
<html>
<head>
<title>随机 Emoji 示例</title>
</head>
<body>
<p id="emojiParagr
</body>
</html>
```
Continue using the same code
```python
from lxml import etree
from mark import BASE_DIR
with open(BASE_
dom = etree.HTML(
print(dom)
```
The output is as follows:
```shell
<Element html at 0x102d05a80>
```
So, the problem is that lxml cannot parse web pages with emojis, and this problem is not repeated on liunx, only macos has this problem. This problem can be replicated in any version of python on macos. This problem can be repeated in any version of lxml on macos.
Changed in lxml: | |
status: | Fix Committed → Fix Released |
Is this on an ARM or x86 system? My guess would be ARM.