lxml on mac os cannot parse html that contains emojis

Bug #2046208 reported by ponponon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Low
scoder

Bug Description

```html
<html>

<head>
    <title>随机 Emoji 示例</title>
</head>

<body>
    <p id="emojiParagraph">😄 这是一个带有随机 Emoji 的段落: </p>

</body>

</html>
```

For html with emoji above it, lxml returns None

```python
from lxml import etree
from mark import BASE_DIR

with open(BASE_DIR/'123.html', 'r', encoding='utf-8') as file:
    dom = etree.HTML(file.read())

    print(dom)
```

The output is as follows:

```shell
None
```

If I delete emojis

```html
<html>

<head>
    <title>随机 Emoji 示例</title>
</head>

<body>
    <p id="emojiParagraph"> 这是一个带有随机 Emoji 的段落: </p>

</body>

</html>
```

Continue using the same code

```python
from lxml import etree
from mark import BASE_DIR

with open(BASE_DIR/'123.html', 'r', encoding='utf-8') as file:
    dom = etree.HTML(file.read())

    print(dom)
```

The output is as follows:

```shell
<Element html at 0x102d05a80>
```

So, the problem is that lxml cannot parse web pages with emojis, and this problem is not repeated on liunx, only macos has this problem. This problem can be replicated in any version of python on macos. This problem can be repeated in any version of lxml on macos.

Revision history for this message
scoder (scoder) wrote :

Is this on an ARM or x86 system? My guess would be ARM.

Changed in lxml:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
scoder (scoder) wrote :

Could you print the first four characters in the string that you pass into the HTML() function? Does your string use a BOM?

Revision history for this message
scoder (scoder) wrote :

Confirmed on macOS, but not in general. It fails on CI with the system library libxml2 2.9.4. When I build a static wheel with libxml2 2.12.3, it works.

BTW, a possible work-around is probably to encode the text string to UTF-8 and pass that into the HTML() function.

I think I've found a fix for 5.0.1.

https://github.com/lxml/lxml/commit/d83c37ddbde9285e58bd6f7bce73b4f31be4a1f2

Changed in lxml:
assignee: nobody → scoder (scoder)
milestone: none → 5.0.1
status: Triaged → Fix Committed
Revision history for this message
scoder (scoder) wrote :
scoder (scoder)
Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.