Stripping whitespace from an XSL transformation crashes in libxslt

Bug #583249 reported by Daniel Varela Santoalla
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
libxslt
New
Undecided
Unassigned
lxml
Confirmed
High
scoder

Bug Description

pepe@celia:~$ python crash/crashme.py
lxml.etree: (2, 2, 6, u'crash')
libxml used: (2, 7, 6)
libxml compiled: (2, 7, 6)
libxslt used: (1, 1, 22)
libxslt compiled: (1, 1, 22)

Doing XSL from the TOP
Parsed form, got <lxml.etree._ElementTree object at 0x819828c>
........
........
........

Doing XSL from the TARGET
Segmentation fault (core dumped)
$

crash/crashme.py is:

import lxml
import lxml.etree
from lxml.etree import Element, ElementTree, tostring

def process(path, add=True, xslFromTop=False):

    doc = lxml.etree.parse("data/form.xml")
    target = doc.xpath(path)[0]

    xslt_doc = lxml.etree.parse(open("static/form.xsl"))
    transform = lxml.etree.XSLT(xslt_doc)
    if xslFromTop:
        xslresult = transform(doc,edit="'True'")
    else:
        xslresult = transform(target,edit="'True'")

    return xslresult, doc

if __name__ == "__main__":

    print "lxml.etree: ", lxml.etree.LXML_VERSION
    print "libxml used: ", lxml.etree.LIBXML_VERSION
    print "libxml compiled: ", lxml.etree.LIBXML_COMPILED_VERSION
    print "libxslt used: ", lxml.etree.LIBXSLT_VERSION
    print "libxslt compiled: ", lxml.etree.LIBXSLT_COMPILED_VERSION

    for i in range(0,10):
        doc = lxml.etree.parse("data/form.xml")
        print "Parsed form, got %s " % doc

    print "Doing XSL from the TOP"
    print process("/Application/Languages",xslFromTop=True)

    print "Doing XSL from the TARGET"
    print process("/Application/Languages")

    print "Finished"

pepe@celia:~$ uname -a
Linux celia 2.6.22.19-0.4-bigsmp #1 SMP 2009-08-14 02:09:16 +0200 i686 i686 i386 GNU/Linux
pepe@celia:~$ python --version
Python 2.5.4

Revision history for this message
scoder (scoder) wrote : Re: [Bug 583249] [NEW] Applying an XSL transformation from a non-root XML element crashes

> doc = lxml.etree.parse("data/form.xml")
> xslt_doc = lxml.etree.parse(open("static/form.xsl"))

Could you provide these two files so that I can test it here?

Thanks,

Stefan

Revision history for this message
Daniel Varela Santoalla (dvarela) wrote :

Thanks Stefan.

Here are the files. You also need default.xsl because it is imported
from form.xsl.

Cheers
d

Daniel Varela Santoalla
European Centre for Medium-Range Weather Forecasts (ECMWF)
(http://www.ecmwf.int)
Email: <email address hidden> Telephone: (+44) 118 9499608

Stefan Behnel wrote:
>> doc = lxml.etree.parse("data/form.xml")
>> xslt_doc = lxml.etree.parse(open("static/form.xsl"))
>
> Could you provide these two files so that I can test it here?
>
> Thanks,
>
> Stefan
>

Revision history for this message
scoder (scoder) wrote : Re: Applying an XSL transformation from a non-root XML element crashes

Crash confirmed, valgrind says it's accessing memory that's already been freed.

Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
scoder (scoder) wrote :

It seems like libxslt modifies the input document when stripping spaces. Removing the <xsl:strip-space> tag might provide a work-around. I've opened an upstream bug.

Changed in libxslt:
status: Unknown → New
Revision history for this message
Daniel Varela Santoalla (dvarela) wrote : Re: [Bug 583249] Re: Applying an XSL transformation from a non-root XML element crashes

Hi Stefan

You are right, removing the <xsl:strip-space elements="*" /> tag
prevents the application from crashing.

This may do it for us if we make sure that the initial XML data doesn't
have any whitespace within the tags, but I still have to check it
thoroughly.

Best regards
Daniel Varela Santoalla

Bug Watch Updater wrote:
> ** Changed in: libxslt
> Status: Unknown => New
>

scoder (scoder)
summary: - Applying an XSL transformation from a non-root XML element crashes
+ Stripping whitespace from an XSL transformation crashes in libxslt
Changed in libxslt:
importance: Unknown → Critical
Revision history for this message
Marek Sebera (marek-sebera) wrote :

This bug does affect my setup using python3 and lxml library, using Debian 9.0

If it will be of any help, I can provide sample XML and XSLT for debugging purposes.

Removing "<xsl:strip-space elements="*"/>" declaration will solve the problem.

Crash occurs in tree.c:3672 see>

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5f0751a in xmlFreeNodeList__internal_alias (cur=0x555555e40af0) at ../../tree.c:3672
3672 ../../tree.c: No such file or directory.

system info>

$> apt-cache policy python3
python3:
  Installed: 3.5.3-1
  Candidate: 3.5.3-1
  Version table:
 *** 3.5.3-1 500
        500 http://ftp.debian.org/debian stretch/main amd64 Packages
        100 http://ftp.debian.org/debian sid/main amd64 Packages
        100 /var/lib/dpkg/status

$> apt-cache policy python3-lxml
python3-lxml:
  Installed: 3.7.3-1
  Candidate: 3.7.3-1
  Version table:
 *** 3.7.3-1 100
        100 http://ftp.debian.org/debian sid/main amd64 Packages
        100 /var/lib/dpkg/status
     3.7.1-1 500
        500 http://ftp.debian.org/debian stretch/main amd64 Packages

Revision history for this message
Nicolas Grégoire (nicolas-gregoire) wrote :

This bug (still unfixed) seems to be caused by lxml, not libxslt. Reproducing this crash using any other libxslt binding (Python, Perl, JavaScript) wasn't possible.

Changed in libxslt:
status: New → Confirmed
Revision history for this message
scoder (scoder) wrote :
Download full text (4.3 KiB)

Here is the valgrind output, using libxslt 1.1.32, libxml2 2.9.8 and CPython 3.7.0:

==20079== 1 errors in context 1 of 50:
==20079== Invalid free() / delete / delete[] / realloc()
==20079== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x6F377FE: xmlFreeNodeList (tree.c:3721)
==20079== by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079== by 0x6F3788C: xmlFreeNodeList (tree.c:3692)
==20079== by 0x6F37583: xmlFreeDoc (tree.c:1253)
==20079== by 0x6415957: __pyx_pf_4lxml_5etree_9_Document___dealloc__ (etree.c:51785)
==20079== by 0x64157B0: __pyx_pw_4lxml_5etree_9_Document_1__dealloc__ (etree.c:51765)
==20079== by 0x6744E65: __pyx_tp_dealloc_4lxml_5etree__Document (etree.c:224844)
==20079== by 0x67457E8: __pyx_tp_dealloc_4lxml_5etree__Element (etree.c:225159)
==20079== by 0x674667F: __pyx_tp_dealloc_4lxml_5etree__ElementTree (etree.c:226052)
==20079== by 0x1FE1CA: tupledealloc (tupleobject.c:246)
==20079== by 0x1688F4: call_function (ceval.c:4615)
[...]
==20079== Address 0x887dd00 is 0 bytes inside a block of size 120 free'd
==20079== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x6AB2744: xsltApplyStripSpaces (transform.c:5732)
==20079== by 0x6AB3733: xsltApplyStylesheetInternal (transform.c:6011)
==20079== by 0x66D4E1F: __pyx_f_4lxml_5etree_4XSLT__run_transform (etree.c:200006)
==20079== by 0x66CE938: __pyx_pf_4lxml_5etree_4XSLT_18__call__ (etree.c:198792)
==20079== by 0x66CC3EF: __pyx_pw_4lxml_5etree_4XSLT_19__call__ (etree.c:198352)
==20079== by 0x18999E: _PyObject_FastCallKeywords (call.c:199)
==20079== by 0x16A617: call_function (ceval.c:4605)
[...]
==20079== Block was alloc'd at
==20079== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20079== by 0x70154AE: xmlSAX2TextNode (SAX2.c:1863)
==20079== by 0x70181BC: xmlSAX2Characters (SAX2.c:2557)
==20079== by 0x6F1C1B3: xmlParseCharData (parser.c:4457)
==20079== by 0x6F29AB6: xmlParseContent (parser.c:9862)
==20079== by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079== by 0x6F29B5A: xmlParseContent (parser.c:9846)
==20079== by 0x6F2A492: xmlParseElement (parser.c:10014)
==20079== by 0x6F2AC1A: xmlParseDocument (parser.c:10711)
==20079== by 0x6F323A0: xmlDoRead (parser.c:15191)
==20079== by 0x6F323A0: xmlCtxtReadFile (parser.c:15436)
==20079== by 0x6558BCB: __pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile (etree.c:122932)
[...]

It shows that libxslt frees text nodes in xsltApplyStripSpaces(), which are then freed again by xmlFreeDoc() later. Meaning, somehow, they still reside in the document, although they have been freed. libxslt clearly corrupts the tree state here, which then leads to a crash when lxml discards the input document.

These nodes are created by the parser in libxml2, freed by the XSLT processor in libxslt, and then freed again by the document disposal in libxml2. All of this is outside of the control of lxml. Honestly, I cannot see what lxml could do to prevent this. It cannot even safely warn about XSLTs that strip whitespace, because that can even be tr...

Read more...

Revision history for this message
scoder (scoder) wrote :

Upstream bug reported again here:
https://gitlab.gnome.org/GNOME/libxslt/issues/14

Changed in libxslt:
importance: Critical → Undecided
status: Confirmed → New
guanlonghuang (jace833)
Changed in lxml:
status: Confirmed → Fix Committed
tags: added: guanlonghuang
scoder (scoder)
Changed in lxml:
status: Fix Committed → Confirmed
tags: removed: guanlonghuang
Revision history for this message
scoder (scoder) wrote :

Upstream bug reported again here:
https://gitlab.gnome.org/GNOME/libxslt/-/issues/54

Revision history for this message
Thomas Maier (thaier) wrote :

After searching for the reason of mysterious memory troubles in my application, I found out that this bug was the case for them.

I understand that it seems to be an upstream problem which can`t be fixed by the lxml team. I nevertheless would have a suggestion to prevent others from going through my trouble:

Would it be possible to add a prominent warning like this to https://lxml.de/xpathxslt.html#xslt:

"WARNING: Due to a bug in libxslt the usage of <xsl:strip-space elements="*"/> in an XSLT stylesheets leads to crashes/memory failures. For details see: https://gitlab.gnome.org/GNOME/libxslt/-/issues/14"

This would have definitely helped me a lot, as I followed the instructions there very closely.

Revision history for this message
scoder (scoder) wrote :

It's not clear under which conditions this crashes, because it may not always crash. But I think a warning may prove more helpful than not. PR welcome.

Revision history for this message
Norman Gray (normangray) wrote :

(moving the discussion here from closed issue https://bugs.launchpad.net/bugs/1972065)

I'm afraid I'm not familiar with libxslt, so I'm afraid I won't be able to commit to creating a reproducer (that's not a 'no' -- I may not be able to resist -- but I think it's unlikely).

But that said...

If I were a libxslt developer, triaging https://gitlab.gnome.org/GNOME/libxslt/-/issues/14 or https://gitlab.gnome.org/GNOME/libxslt/-/issues/54 , I think I would be unconvinced that the problem was necessarily in libxslt, rather than in the wrapping code (I've wrapped libraries in higher-level languages before, and I'm painfully familiar with the memory handling challenges involved). Thus I think a reproducer from someone would be necessary before this report were taken seriously in libxslt-land. That reproducer would require more familiarity with libxslt and libxml than I am likely to acquire.

Like you, I'm surprised that the input document is modified in memory when xsl:strip-space is present, but I don't see anything in a quick look through the libxslt documentation that promises that it isn't thus modified, or that the input tree provided to xsltApplyStylesheet is expected to be reusable afterwards. As far as I can see, from https://gnome.pages.gitlab.gnome.org/libxslt/devhelp/libxslt-transform.html and https://gnome.pages.gitlab.gnome.org/libxslt/tutorial/libxslttutorial.html (though, as I say, I haven't studied these in detail), even if the input document were comprehensively trashed as part of the transformation, that would not violate any commitments in the API documentation. Thus I suspect that the response to libxslt issue 14 might be ‘don't do that, then’.

We've established that the xsl:strip-space problem appears (only?) when the library is processing an input document, starting from somewhere _other_ than the root. I'm guessing that you're using xsltApplyTemplates to do that. The documentation in https://gnome.pages.gitlab.gnome.org/libxslt/devhelp/libxslt-transform.html#xsltApplyTemplates is unfortunately rather terse, and in particular doesn't say whether starting a transform from here is supported behaviour on the part of the library. I notice that that function isn't mentioned in the 'Extended Tutorial' either. That is, this could result in another ‘don't do that, then’.

I can see the function xsltNewTransformContext, but xsltTransformContextPtr is documented in the xsltInternals section of the API documentation, which suggests (i) that use of the context pointer, and the xsltNewTransformContext function, is not expected to be used from user code (which the lxml wrapper would count as); and (ii) the fact that something is visible and 'documented' does not imply that it is intended to be used by external code (ie, just because xsltApplyTemplates is visible might not imply that the library expects others to use it).

What I've written here is part speculation, of course, and I acknowledge that may not be very valuable as a contribution to this issue. I say this only to observe that, putting myself in the position of the libxslt developer, I would not feel forced to agree that there was an actual libxslt bug here.

Revision history for this message
Norman Gray (normangray) wrote :

I knew I wouldn't be able to resist....

I've added [libxslt issue 67](https://gitlab.gnome.org/GNOME/libxslt/-/issues/67), which is a C-only reproduction of [libxslt issue 14](https://gitlab.gnome.org/GNOME/libxslt/-/issues/14). I note in that report that this may be a docbug, in the sense that this _may_ be intended behaviour but insufficiently clearly highlighted (I think that would be unfortunate).

I was thinking a little more about the aspect of the [other report](https://bugs.launchpad.net/lxml/+bug/1972065) that the double-free error happened only when it was a _part_ of the tree that was being transformed, and I wanted to understand how libxslt does what it does.

I don't know how lxml implements this transformation on a subtree, but in the attached tarball I include one way of doing this, which manages to produce the

    do-transform-3(95501,0x11be40600) malloc: *** error for object 0xffffffffffffffff: pointer being freed was not allocated

error if `AVOID_ERROR` is 0, and completes without error if it's defined to be true. I don't know if that's useful. informative, or interesting to you.

I'm conscious that there are two potentially very separate issues being discussed here, and possibly conflated, namely the double-free which I noticed in bug #1972065, and the double-transformation which this current report is focused on. The puzzling thing (to me) is that the double-free problem only seems to happen with a stylesheet which includes <xsl:strip-space/>, and it's that that is the link to the double-transformation problem.

Revision history for this message
Thomas Maier (thaier) wrote :

I don't know if this might help in fixing this bug, but I noticed the following things:

1. If I do the xslt transformation on a deepcopy of the subtree I do not get the memory errors when stripping whitespaces.

2. My current script uses lxml etree to find certain elements in a xml file (2,4MB) and then only uses xslt on the found sub-elements. At first this script used to run 5+ minutes. I was able to figure out that the xslt transformation was causing the problem as it seemed to go through the whole tree, even though I only passed a subtree. Here again a deepcopy solved the issue and the script works in under 1 second.

So to me it seems that when passing a subelement xslt still somehow processes the whole tree.

Hope that helps!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers