XMLSyntaxError: Char 0x0 out of allowed range for \U sequence on Python 3.9

Bug #1902364 reported by Andreas Maier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

On Python 3.9, the following code fails. With Python 3.8 and earlier versions down to Python 3.4 and Python 2.7, that code worked fine:

    from lxml import etree
    xml_str = u'<FOO>\U00010142</FOO>'
    xml_obj = etree.fromstring(xml_str)

Traceback (most recent call last):
  File "/Users/maiera/PycharmProjects/pywbem/pywbem/tmp_issues/lxml_char0.py", line 8, in <module>
    xml_obj = etree.fromstring(xml_str)
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

This has to do with the use of upper case "\U". The error does not happen when lower case "\u" is used.

Versions on Python 3.9 where it fails:

Python : sys.version_info(major=3, minor=9, micro=0, releaselevel='final', serial=0)
lxml.etree : (4, 6, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

Platform: macOS 10.15.7
Python implementation: CPython

Revision history for this message
Andreas Maier (maiera) wrote :

There are additional different errors with UCS-2 characters on pypy3.

I have expanded the test program a little:

#!/usr/bin/env python

import sys
import platform
import traceback
from lxml import etree

def test_fromstring(test, xml_src):
    xml_str = eval(xml_src)
    print("\nExecuting test {}: etree.fromstring({}) (evaluated: {!r})".
          format(test, xml_src, xml_str))
    try:
        xml_obj = etree.fromstring(xml_str)
    except Exception:
        print("Failed, traceback follows:")
        traceback.print_exc()
    else:
        print("Success")

test_fromstring("text2", "u'<FOO>\\u00E9</FOO>'")
test_fromstring("attr2", "u'<FOO NAME=\"\\u00E9\"/>'")
test_fromstring("text4", "u'<FOO>\\U00010142</FOO>'")
test_fromstring("attr4", "u'<FOO NAME=\"\\U00010142\"/>'")

print("\nVersions:")
print("%-20s: %s" % ('Platform system', platform.system()))
print("%-20s: %s" % ('Platform release', platform.release()))
print("%-20s: %s" % ('Python impl.', platform.python_implementation()))
print("%-20s: %s" % ('Python impl. version', getattr(sys, 'pypy_version_info', platform.python_revision())))
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

#--- end of test program

The results are (all on macOS with lxml version and its libraries' versions as described in the original bug description):

Impl. Python lxml text2 attr2 text4 attr4
---------------------------------------------------------
PyPy 2.7.13 4.6.1 ERR1 ERR2 ERR3 ERR3
PyPy 3.6.9 4.6.1 ERR1 ERR2 ERR3 ERR3
CPython 2.7.16 4.6.1 SUCC SUCC SUCC SUCC
CPython 3.8.6 4.6.1 SUCC SUCC SUCC SUCC
CPython 3.9.0 4.6.1 SUCC SUCC ERR3 ERR3

PyPy 2.7.13 4.5.2 SUCC SUCC SUCC SUCC (!)
PyPy 3.6.9 4.5.2 ERR1 ERR2 ERR3 ERR3
CPython 3.9.0 4.5.2 SUCC SUCC ERR3 ERR3

PyPy 2.7.13 3.8.0 SUCC SUCC SUCC SUCC
PyPy 3.6.9 3.8.0 fails upon from lxml import etree

The errors mentioned in the table above are:

ERR1: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range
ERR2: lxml.etree.XMLSyntaxError: expected '>'
ERR3: lxml.etree.XMLSyntaxError: attributes construct error

Revision history for this message
Andreas Maier (maiera) wrote :

One more comment: If the Python unicode strings are converted to binary strings with utf-8 encoding, all of the test cases succeed on all of the implementations and versions listed in the previous comment.

I still think that raising these exceptions on unicode strings is not acceptable. After all, the documentation states for the XML() function that both unicode and binary strings are accepted as input, and for the fromstring() function it is silent about any restrictions.

Revision history for this message
scoder (scoder) wrote :

You're using libxml2 2.9.4, apparently. Could you try the released wheel? It comes with the latest libxml2.

Changed in lxml:
status: New → Triaged
Revision history for this message
Andreas Maier (maiera) wrote :

With "released wheel" are you referring to a Python package or to an OS-level package that contains the wheel archive?

On Pypi, I found 'libxml2-python3', and installing that fails due to missing headers (on Python 3.9.0 on macOS):

$ pip install libxml2-python3
Collecting libxml2-python3
  Downloading libxml2-python3-2.9.5.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 2.9 MB/s
    ERROR: Command errored out with exit status 1:
     command: /Users/maiera/virtualenvs/pywbem39/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/lh/v0_07k9d7dbfqdytfzzxks3r0000gn/T/pip-install-koagvkuh/libxml2-python3/setup.py'"'"'; __file__='"'"'/private/var/folders/lh/v0_07k9d7dbfqdytfzzxks3r0000gn/T/pip-install-koagvkuh/libxml2-python3/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/lh/v0_07k9d7dbfqdytfzzxks3r0000gn/T/pip-pip-egg-info-0068v7va
         cwd: /private/var/folders/lh/v0_07k9d7dbfqdytfzzxks3r0000gn/T/pip-install-koagvkuh/libxml2-python3/
    Complete output (1 lines):
    failed to find headers for libxml2: update includes_dir
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I do have the headers installed as part of the OS-level "libxml2" package.

$ brew list libxml2
/usr/local/Cellar/libxml2/2.9.10_2/bin/xml2-config
/usr/local/Cellar/libxml2/2.9.10_2/bin/xmlcatalog
/usr/local/Cellar/libxml2/2.9.10_2/bin/xmllint
/usr/local/Cellar/libxml2/2.9.10_2/include/libxml2/ (47 files)
/usr/local/Cellar/libxml2/2.9.10_2/lib/libxml2.2.dylib
/usr/local/Cellar/libxml2/2.9.10_2/lib/cmake/libxml2/libxml2-config.cmake
/usr/local/Cellar/libxml2/2.9.10_2/lib/pkgconfig/libxml-2.0.pc
/usr/local/Cellar/libxml2/2.9.10_2/lib/python3.9/ (4 files)
/usr/local/Cellar/libxml2/2.9.10_2/lib/ (3 other files)
/usr/local/Cellar/libxml2/2.9.10_2/share/aclocal/libxml.m4
/usr/local/Cellar/libxml2/2.9.10_2/share/doc/ (152 files)
/usr/local/Cellar/libxml2/2.9.10_2/share/gtk-doc/ (55 files)
/usr/local/Cellar/libxml2/2.9.10_2/share/man/ (4 files)

What do I need to do to make the obviously installed header files known to the installation?

BTW, brew reports version 2.9.10 of libxml2 while in Python, lxml.etree.LIBXML_VERSION and lxml.etree.LIBXML_COMPILED_VERSION both report 2.9.4:

$ brew info libxml2
libxml2: stable 2.9.10 (bottled), HEAD [keg-only]

Revision history for this message
Andreas Maier (maiera) wrote :

BTW, we have circumvented the issues by passing binary strings to etree.fromstring() and etree.XML().

Revision history for this message
Andreas Maier (maiera) wrote :

I verified that the include files are in /usr/local/opt/libxml2/include/libxml2/libxml.

I tried adding "-I/usr/local/opt/libxml2/include" and "-I/usr/local/opt/libxml2/include/libxml2" to CFLAGS and to CPPFLAGS, neither of which helped.

Revision history for this message
scoder (scoder) wrote :

> With "released wheel" are you referring to a Python package or to an OS-level package that contains the wheel archive?

I meant the "released wheel" of lxml. You should not need to install libxml2 in that case.

> libxml2-python3

That's unrelated.

Revision history for this message
scoder (scoder) wrote :

Closing, probably due to an old libxml2 library version.

Changed in lxml:
status: Triaged → Invalid
Revision history for this message
reagle (joseph-acct) wrote :

I'm having this problem on OS: macOS 11.5.1 20G80 arm64 and is discussed in [this gist](https://gist.github.com/karlcow/5c11c06fb0345ea02ad51e5f7e9a2d9f#gistcomment-3846730). Should I open a new issue?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.