Different libxml2 versions in lxml and xmlsec misbehave

Bug #1960668 reported by dx
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Hey there!

I got a weird issue that's reproducible but only on some environments. It's a regression between 4.6.5 and 4.7.1.

Something between xmlsec and doing .find() on its output.

Affected:
- ubuntu 20.04 + lxml 4.7.1 from manylinux wheels
- debian 11 + lxml 4.7.1 from manylinux wheels

Not affected:
- ubuntu 21.10 and 22.04
- arch linux
- lxml 4.6.5
- any lxml built from the source tree with static libxml off
- running under pytest on debian 11 + lxml 4.7.1

Seems to be independent of python versions (the debian 11 images are docker's python:3.8 and python:3.10)

I did a bisect with `make wheel_manylinux_2_24_x86_64 PYTHON_BUILD_VERSION='cp38*'`

The commit that introduced the issue is 7b941e58ab088a25a8e0a7f6e13e4e5b9dd93c37

>commit 7b941e58ab088a25a8e0a7f6e13e4e5b9dd93c37 (HEAD, refs/bisect/bad)
>Author: Stefan Behnel <email address hidden>
>Date: Wed Nov 3 09:50:09 2021 +0100
>
> Switch to latest libxml2 2.9.12+ (unreleased) that has fixes for traversing lxml's fake root trees.

Library versions of the ubuntu 20.04 + python 3.8 running the commit above:

Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
lxml.etree : (4, 6, 4, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

ii libxml2:amd64 2.9.10+dfsg-5ubuntu0.20.04.1 amd64 GNOME XML library
ii libxmlsec1:amd64 1.2.28-2 amd64 XML security library

Library versions of the ubuntu 22.04 where this issue cannot be reproduced:

Python : sys.version_info(major=3, minor=9, micro=10, releaselevel='final', serial=0)
lxml.etree : (4, 7, 1, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

ii libxml2:amd64 2.9.12+dfsg-5 amd64 GNOME XML library
ii libxmlsec1:amd64 1.2.33-1build1 amd64 XML security library

The test case: (testcase.py)

```
import xmlsec
from lxml import etree as ET

envelope = ET.fromstring('<a></a>')

signature = xmlsec.template.create(
    envelope,
    xmlsec.Transform.EXCL_C14N,
    xmlsec.Transform.RSA_SHA256,
    ns="ds"
)

ds = ET.QName(signature).namespace

canonicalization_method = signature.find(".//{%s}CanonicalizationMethod" % ds)
if canonicalization_method is not None:
    print('ok')
else:
    print('fail')
    exit(1)
```

Dockerfile based on debian 11:

```
FROM python:3.10
RUN apt update && \
    apt install -y libxmlsec1 pkg-config libxmlsec1-dev && \
    pip install xmlsec lxml
COPY testcase.py /usr/src/
```

Running:

    $ docker build -t testcase .
    $ docker run --rm -it -v $PWD:/usr/src testcase python /usr/src/testcase.py
    fail

Replacing this find:

    signature.find(".//{%s}CanonicalizationMethod" % ds)

with this:

    signature.xpath(".//ds:CanonicalizationMethod", namespaces={"ds": ds})

...makes it succeed.

Absolutely no idea why this test case doesn't fail when running in the exact same docker container but under pytest, but I didn't get around to do much minimization of that setup

Revision history for this message
dx (dx) wrote :
Revision history for this message
dx (dx) wrote :
Revision history for this message
Michal Čihař (nijel) wrote :

This seems to be the cause of https://github.com/onelogin/python3-saml/issues/292, for me the workaround is not to use official wheels, but build from the source, what will make it use system libxml....

Revision history for this message
Anders Kaseorg (andersk) wrote :

Note to those trying to reproduce: the test case file attached to comment 1 is the version with the workaround applied, so it doesn’t reproduce the bug. Copy the test case from the original description instead.

Revision history for this message
Stu Tomlinson (nosnilmot) wrote :

An alternative 'workaround' is to serialize and deserialize before the signature.find(), ie:

signature = ET.fromstring(ET.tostring(signature))

A similar issue is also reported here:
https://<email address hidden>/thread/SCMXQYGN7CQMSPJI3PEW2YBT4YZKNML2/

Revision history for this message
scoder (scoder) wrote :

Since this changes with the libxml2 version, it's probably also a bug in libxml2.

The most likely reason for this is that lxml takes a shortcut for comparing tag names, knowing that they are hash-deduplicated in the tree. It therefore compares them by pointer and not by value. If anything adds non-deduplicated tag names to the tree, thus violating one of lxml's tree invariants, then lxml won't find them any more. This does not affect XPath, which uses the normal (more costly) string comparison by value in libxml2.

Did anyone manage to reproduce this without xmlsec? If not, then it might also a problem over there. I don't know how xmlsec creates its lxml tree.

Revision history for this message
Stu Tomlinson (nosnilmot) wrote :

I've done some further testing and I believe I have narrowed down the root cause here to be a problem with xmlsec and lxml using different versions of libxml2 at runtime.

lxml wheels are built with statically linked libxml2 (2.9.12+ for lxml>4.7.0) while xmlsec will compile with (and dynamically link to) system libxml2 at installation time. If these are sufficiently different versions, I think some libxml2 internal mismatch causes the issue reported here.

Simply upgrading the system libxml2 (note: I do not recommend anyone actually does this to replace OS provided libxml2!) without recompiling or reinstalling xmlsec or lxml is one way to "resolve" it.

Ideally, lxml would be dynamically linked to libxml2, and share the same system-provided libxml2 library as xmlsec. I do not fully understand why this is not the case, or what the capabilities or limitations of python packaging are.

My recommended solution/workaround, for the time being, is for affected users to install lxml using 'pip install --no-binary lxml lxml' to force it to be built locally, which will also result in using shared libxml2 instead of static.

Other suggestions are welcome, especially if this further detail allows for a code-based solution that avoids all problems :)

Revision history for this message
Sandro (supersandro2000) wrote :

> My recommended solution/workaround, for the time being, is for affected users to install lxml using 'pip install --no-binary lxml lxml' to force it to be built locally, which will also result in using shared libxml2 instead of static.

I just discovered that using libxml2 2.10.0 results in `etree.fromstring` returning None valid html pages. I would appreciate it if lxml would upgrade its supported libxml2 versions.

scoder (scoder)
summary: - A specific .find() returns None since the switch to libxml2 2.9.12+
+ Different libxml2 versions in lxml and xmlsec misbehave
Revision history for this message
scoder (scoder) wrote :

Yes, the above is the correct fix. Use a local source build to make sure that both xmlsec and lxml use the same version of libxml2.

    pip install --no-binary lxml lxml

Note that the Anaconda/condaforge/etc. packages should also come with matching libxml2 libraries and should thus work out of the box.

It's generally difficult to assure that different Python packages that depend on an external library get to use the same library version since Python packages cannot control the system libraries. And the large majority of all lxml installations benefits from a one-package-includes-all binary installation. Complicating that for making the life of xmlsec users easier would be the wrong trade-off (sorry).

I'm open for improvements, but I value the simplicity of millions of installations higher than that of thousands.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.