lxml.html.document_fromstring fails with certain emojis

Bug #1949271 reported by Aloisio R
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=10, micro=0, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

LXML 4.6.3 in MacOS Mojave (10.14.6) fails to parse html input in unicode str or utf-8 bytes for certain emojis (ZWJ sequences)

This pytest test script (from my test suite) will pass if the bug is present:

import pytest
from lxml.html import document_fromstring

def test_lxml_463_emoji_bug():
    def assert_emoji_parsing(transform):
        # Woman Facepalming Emoji
        # See https://unicode.org/emoji/charts/full-emoji-list.html
        content = u'<p>\U0001F926\u200D\u2640\uFE0F</p>'
        doc = document_fromstring(transform(content))
        assert doc[0][0].text == u'\U0001F926\u200D\u2640\uFE0F'

    with pytest.raises(Exception):
        assert_emoji_parsing(lambda c: c)
    with pytest.raises(Exception):
        assert_emoji_parsing(lambda c: c.encode('utf-8'))
    assert_emoji_parsing(lambda c: c.encode('utf-16'))
    assert_emoji_parsing(lambda c: c.encode('utf-32'))

Notice that a workaround for this issue is to use UTF-16 or UTF-32 bytes.

Revision history for this message
James Addison (jaddison) wrote :

I've encountered similar behaviour on MacOS 11.7 (Big Sur) when parsing an example UTF-8 encoded HTML file that contains at least two multibyte characters.

One detail learned while attempting to narrow down the cause: the problem disappears when the 'lxml' dependency is installed from binary wheel.

A near-minimal repro case is available at https://github.com/jayaddison/macos-lxml-issue-repro.git/

Revision history for this message
Ohad Livne (ohadlb) wrote :

Python: sys.version_info(major=3, minor=10, micro=8, releaselevel='final', serial=0)
OS: macOS Monterey 12.6 (Intel processor)
lxml.etree: 4.9.1

I see a somewhat different behaviour but might be related. With

doc = '<!DOCTYPE html><html><head><meta charset="utf-8"></head><body>\U0001f44b</body></html>'
html.fromstring(doc).text_content()

I get the output

'! D O C T Y P E h t m l > '

I only see this with lxml 4.9.1, while earlier versions (I tried from 4.7.0 to 4.9.0) parse the data correctly. It's also macOS specific - a Linux installation with lxml 4.9.1 doesn't have this problem.

Revision history for this message
James Addison (jaddison) wrote :

Ohad: yep, that seems potentially related to me. I saw the same behaviour (ASCII characters from the original input, separated by spaces, in the parsed text output) when using locally-compiled versions of lxml on OSX during investigation of the bug.

Revision history for this message
scoder (scoder) wrote :

This might be related to the way lxml is installed, whether the installation uses a PyPI wheel (which should hopefully work) or builds its own one locally (which may not use the same library versions). Could you check which is the case for you? If a wheel is built locally, then the build log should indicate the library versions used.

Revision history for this message
Ohad Livne (ohadlb) wrote :

scoder: As you predicted, 4.9.1 is built locally while earlier versions are installed as wheels.

The build uses the libraries libxml2 2.9.4 and libxslt 1.1.29. I uploaded the full log to https://paste.ubuntu.com/p/49msGfnKtB/

Revision history for this message
Ohad Livne (ohadlb) wrote :

I just realized that I have libxml2 2.10.3 and libxslt 1.1.37 installed on my system, which don't match the versions used during the local compilation, so that could be a problem.

I'm currently trying to get the proper include files picked up in a rebuild.

Revision history for this message
Ohad Livne (ohadlb) wrote :

Nevermind. These versions weren't actually installed and the build seems to link the libraries statically anyway (is this correct? At least they don't appear in the output of otool -L)

Revision history for this message
Arthur Rio (arthurio88) wrote :

I can also confirm that bug with Mac OS 13.1, Python 3.10.7, lxml==4.9.1. Let me know if you need more details.

Revision history for this message
Tom Ritchford (tom.swirly) wrote :

I can also reproduce this on lxml 4.9.2, MacOS 12.6.2, Python 3.11.1, with a very small example.

This took me too long to figure out. The fact that etree.HTML(s) sometimes returns `None` and sometimes returns a broken HTML object is a bit confounding.

**There's a near-trivial workaround - call `.encode()` on any `str` that you pass in.**

Here's the code to reproduce, which also shows the workaround.

    from lxml import etree

    def round_trip(s):
        html = etree.HTML(s)
        assert html is not None, 'etree.HTML(s) returned None!'

        result = etree.tostring(html, pretty_print=True, method="html")

        print('Result:')
        print(result.decode())

    def compare_both():
        p1 = '<html><head><title>ANT</title></head><body></body></html>'

        # Works
        round_trip(p1)
        round_trip('\n' + p1)

        p2 = '<html><head><title>🐜</title></head><body></body></html>'

        # The workaround!
        round_trip(p2.encode())
        round_trip(('\n' + p2).encode())

        # Fails
        round_trip(p2) # Wrong answer
        round_trip('\n' + p2) # etree.HTML(s) returns None

and here are the results:

    Result:
    <html>
    <head><title>ANT</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>ANT</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>&#240;&#159;&#144;&#156;</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>&#240;&#159;&#144;&#156;</title></head>
    <body></body>
    </html>

    Result:
    <html><body><p>h t m l &gt; </p></body></html>

    Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
      File "<frozen runpy>", line 88, in _run_code
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 93, in <module>
        compare_both()
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 64, in compare_both
        round_trip('\n' + p2) # Returns None
        ^^^^^^^^^^^^^^^^^^^^^
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 41, in round_trip
        assert html is not None, 'etree.HTML(s) returned None!'
               ^^^^^^^^^^^^^^^^
    AssertionError: etree.HTML(s) returned None!
            0.03 real 0.02 user 0.00 sys

Revision history for this message
Tom Ritchford (tom.swirly) wrote :

I just discovered that unfortunately, calling `.encode()` is not a full solution, because the resulting text (`&#240;&#159;&#144;&#156;`, for example) does not actually render in the browser correctly.

I'll report back once more when I have a final workaround.

Revision history for this message
Tom Ritchford (tom.swirly) wrote :

This turned out to be considerable work, though perhaps some library existed I didn't know about to convert the misencoded chars.

Code is here:

    def run(s: str):
        if isinstance(s, str):
            s = s.encode()
        html = etree.HTML(s)
        t = etree.tostring(html, pretty_print=True, method="html")
        return fix_non_ascii(t.decode())

    def fix_non_ascii(s):
        # Example: &#237;&#156;&#132;&#237;&#156;&#136;&#240;&#159;&#152;&#134;

        def replace(m):
            parts = [int(i.strip(';')) for i in m.group(0).split('&#') if i]
            return ''.join(to_chars(parts))

        pat = r'(&#[12]\d\d;)+'
        return re.sub(pat, replace, s)

    def to_chars(parts):
        parts = parts[::-1] # So we can pop from the end!
        while parts:
            a, b = parts.pop(), parts.pop()
            if a < 0xE0:
                yield chr(b + 0x40 * (a - 0xC2))

            elif a < 0xF0:
                c = parts.pop()
                yield chr(
                    0x800 + (c - 0x80) +
                    + 0x40 * (
                        (b - 0xA0)
                        + 0x40 * (a - 0xE0)
                    )
                )
            else:
                c, d = parts.pop(), parts.pop()

                yield chr(
                    0x010000 + (d - 0x80)
                    + 0x40 * ((c - 0x80)
                        + 0x40 * ((b - 0x90)
                            + 0x40 * (a - 0xF0)
                        )
                    )
                )

Revision history for this message
X (zfhdk) wrote :

I may have the same problem. Here is some minimal code to see the bug:

from lxml import html
from lxml import etree
root = html.fragment_fromstring("<p>🐻</p>")
print(etree.tostring(root))

Run it in Python3.9 and Python3.11 I got different results (both with lxml 4.9.2):
(the one from Python3.9 is correct)

% python3.9 test.py
b'<p>&#128059;</p>'

% python3.11 test.py
b'<p>h t m l &gt; </p>'

Revision history for this message
X (zfhdk) wrote :

This simply throws an error in Python3.11 (lxml 4.9.2) and works fine for Python3.9 (lxml 4.9.2)

from lxml import etree
root = etree.fromstring("<p>🐻</p>")

Revision history for this message
Mike Edmunds (medmunds) wrote :

A workaround seems to be forcing the input to ascii with html entity encoding (xml character refs):

>>> from lxml import etree
>>> root = etree.fromstring("<p>🐻</p>")
...
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

>>> root = etree.fromstring("<p>🐻</p>".encode("ascii", "xmlcharrefreplace").decode("ascii"))
>>> root.text
'🐻'

(lxml 4.9.2, Python 3.11.1)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.