lxml.etree.fromstring fails silently when the input includes U+1F4C2 'OPEN FILE FOLDER'

Bug #2002887 reported by Jouni K. Seppänen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Version information: Python 3.11 on macOS 13.1.

Python : sys.version_info(major=3, minor=11, micro=0, releaselevel='final', serial=0)
lxml.etree : (4, 9, 2, 0)
libxml used : (2, 9, 13)
libxml compiled : (2, 9, 13)
libxslt used : (1, 1, 35)
libxslt compiled : (1, 1, 35)

In the following interaction in IPython, lxml.etree.fromstring fails silently (returning None) when the input includes the Unicode character 📂 (U+1F4C2 'OPEN FILE FOLDER'). It works fine when the input includes e.g. → (U+2192 'RIGHTWARDS ARROW') or when it includes the folder character as a numeric reference.

In [3]: import lxml.html

In [4]: import lxml.etree

In [5]: parser = lxml.html.HTMLParser(remove_blank_text=True)

In [6]: lxml.etree.fromstring('<html>hello</html>', parser)
Out[6]: <Element html at 0x1026c29e0>

In [7]: lxml.etree.fromstring('<html>hello 📂</html>', parser)

In [8]: lxml.etree.fromstring('<html>hello →</html>', parser)
Out[8]: <Element html at 0x1029e6c10>

In [9]: lxml.etree.fromstring('<html>hello &#x1f4c2;</html>', parser)
Out[9]: <Element html at 0x102aa6c10>

I am attaching a test script that runs essentially that code (except the failing line is last). For me, the script outputs

Traceback (most recent call last):
  File "/private/tmp/sample.py", line 8, in <module>
    assert lxml.etree.fromstring('<html>hello 📂</html>', parser) is not None
AssertionError

Revision history for this message
Jouni K. Seppänen (jks) wrote :
Revision history for this message
scoder (scoder) wrote :

Probably due to the missing Py3.11 wheel (I guess you built your local installation yourself).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.