Provide some way to pass in a pre-built libxml2 tree

Bug #1595781 reported by Kovid Goyal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Wishlist
Unassigned

Bug Description

Hi,

Reading through the docs/FAQ I see no way to pass a pre-built libxml2 tree to lxml (say via a Py_Capsule).

Background:

In my application, I use a custom C based parser that results in a xmlDocPtr (libxml2 document). I would like to convert this into an lxml tree. However, I could find no API for doing this in lxml.

Ideally, I'd like a python api, something like

from lxml.etree import wrap_libxml2_document

root = wrap_libxml2_document(py_capsule, free_function=None)

Where free function should be a python function that is passed the capsule as an argument and is responsible for freeing the memory in the libxml2 document. If free_function is None, lxml should clone the tree.

I'd be willing to implement this myself if you can give me a few pointers on where to get started in the lxml codebase.

Revision history for this message
Kovid Goyal (kovid) wrote :

On second thoughts, when free_function is None, it should probably default to using xmlFreeDoc() rather than creating a copy.

Revision history for this message
Kovid Goyal (kovid) wrote :

Is there some reason this request is being ignored? If you dont want to implement such an API I'd appreciate your saying so, then I can move on.

Revision history for this message
scoder (scoder) wrote :

Sorry for being unresponsive, for reasons unrelated to this topic.

It doesn't really feel safe to provide a "use this arbitrary pointer and dealloc it" kind of Python-level interface to lxml. If you have a C based parser anyway, why wouldn't it be enough to provide this function at the C-API level?

Changed in lxml:
importance: Undecided → Wishlist
status: New → Triaged
Revision history for this message
Kovid Goyal (kovid) wrote :

Providing it at the C-API level means that the extension becomes dependent on lxml. Which makes building/distributing it harder. After all creating an lxml tree is only one possible use that the C based parser can be put to. If there is a python API in lxml then my extension can remain completely independent of lxml and people that want to use it in lxml would only need to do

from my_extension import parse
capsule = parse(...)
from lxml.etree import wrap_libxml2_document
root = wrap_libxml2_document(capsule)

This keeps coupling to a minimum. If you are concerned about people abusing the API, then you can add loud warnings to it, either using the documentation, the python warnings module or even adding a boolean parameter tot he function like

def wrap_libxml2_document(capsule, i_know_what_i_am_doing=False):
   if not i_know_what_i_am_doing:
        raise Exception('...some warning message about using this API...')

Revision history for this message
Kovid Goyal (kovid) wrote :

I'd like a decision, please, so I can move on.

Revision history for this message
scoder (scoder) wrote :

I'm still not happy with the general idea. lxml makes a couple of assumptions about the tree, which may no longer hold for externally provided trees. That suggests that lxml would at least want to make a copy instead of using the tree as it comes in.

Is it really that bad to serialise the tree after building it and to parse it with lxml then? How performance critical can such code be?

What kind of "custom parser" do you actually use? Is that really external to libxml2?

Revision history for this message
Kovid Goyal (kovid) wrote :

We are talking about parsing and serializing HTML files that can in the worst case be several tens of MB, on relatively weak hardware (user computers, not servers). On my dev machine (3.7GHz i7) running serialize+parse on this document https://www.w3.org/TR/html5/single-page.html which is only 5.7MB takes 166 milliseconds on average.

python -m timeit -s 'import html5lib, lxml; root = html5lib.parse(open("single-page.html").read(), treebuilder="lxml")' 'lxml.etree.fromstring(lxml.etree.tostring(root))'
10 loops, best of 3: 166 msec per loop

The custom parser is an HTML 5 parser implemented in C for much better performace than html5lib. (html5lib's default lxml tree builder is horrible. I implemented a better one here: https://github.com/html5lib/html5lib-python/issues/119 but its performance is still pretty bad).

If the assumptions that lxml makes about the tree are documented there is no reason why third party parsers cannot conform to them. As I said before, you can mark the API as dangerous in one of the many ways I mentioned.

Revision history for this message
scoder (scoder) wrote :

Ah, ok, but your example is using the html5lib parser. For that, 166ms is actually not bad at all. What I was thinking of was to use your parser that apparently generates a libxml2 tree, then let libxml2 serialise it to a UTF-8 string, and then let lxml parse that. That should be quite quick and might still beat html5lib.

Anyway, given that you seem to be struggling hard to improve the parsing performance, I guess that loosing time in a parse-serialise-parse cycle is not going to make you happy.

Is that custom HTML5 parser available somewhere? Could it be merged into libxml2, for example?

Using a capsule for passing the tree around seems ok. lxml could require a specific name for the capsule, e.g. "libxml2_html_xmlDocPtr", to make it explicit what the content is and how it should be handled. Basically, anything that can execute C code is responsible for its own crashes, so if external native code is able to forge a libxml2 tree, letting it pass that into lxml doesn't subtract anything from the safety level.

Is there a reason why you'd need the "free_function", though ? I cannot imagine that there is code out there that creates an xmlDoc with anthing but xmlDocNew(). Or is your intention to apply additional cleanup measures to the document? That is a problem all by itself, though, right? I assume that the capsule needs to own the document while it is being passed around. If so, what if the user receives the capsule and throws it away immediately? I wouldn't want that case to leak the entire document memory.

That means that lxml must *always* make a copy of what it receives, and the capsule must *always* clean up what it holds and tie the document memory lifetime to its own. Unless we do the ownership transfer at the C level, as I suggested...

Revision history for this message
Kovid Goyal (kovid) wrote :

No, the html5lib parsing happens only once on setup. See the -s flag used with timeit. The 166ms time is really for serialize+parse only. html5lib would takes ~15 seconds to parse that document (which is ridiculous, and the reason I want to replace it)

time python -c "import html5lib; html5lib.parse(open('single-page.html').read(), treebuilder='lxml')"
python -c 15.72s user 0.18s system 99% cpu 15.906 total

As for what parser I would use, there are several, the most out-of-the-box ready one would be https://github.com/nostrademons/gumbo-libxml

However, more interesting would be to adapt the HTML 5 parser in Rust from the Servo project.

In either case, my needs are not for a vanilla html 5 parser, I would need to modify it in some ways to make it able to parse both invalid XHTML 5 and HTML 5 which is what we have in the ebook world.

I dont need the free function myself, it's just there for the sake of completeness of the API. I'm fine if you leave it out.

The way the capsule works is that my parser module creates the libxml2 document and wraps it in a capsule with a NULL destructor. That means it will not be freed when the capsule is garbage collected. The capsule is then passed to lxml which takes over ownership of the libxml2 document and becomes respnosible of freeing it. So, no, lxml does not need to copy the tree.

Revision history for this message
Kovid Goyal (kovid) wrote :

Just to be clear what I am talking about, the capsule I refer to is this: https://docs.python.org/2/c-api/capsule.html

Revision history for this message
scoder (scoder) wrote :

I'm aware of what a capsule is. And I don't want an API that encourages C level memory leaks at the Python level just by dropping an object reference, e.g. because some recoverable exception got in the way somewhere.

Revision history for this message
Kovid Goyal (kovid) wrote :

You dont want the capsule to have a NULL destructor, then that is also fine. Simply have lxml call
PyCapsule_SetDestructor(NULL) when it takes over ownership of the pointer to the libxml2 document.

Then there is no possibility of memory leaks and lxml does not have to make a copy of the document.

Revision history for this message
scoder (scoder) wrote :

... except when there is more than one reference to the capsule. I thought of that, too, yes.

However, by setting both the destructor and the pointer in the capsule to NULL, lxml can invalidate the capsule and take ownership of the pointer. Further attempts to pass that capsule into lxml would then raise an exception (ValueError, I guess). Not great, but at least safe and probably as intended in almost all cases.

One remaining problem: in order to safely take ownership of the document, lxml must validate if the destructor is really the xmlDocFree function or some (wrapper) function that would need calling. If lxml and the external library use different versions (or different static builds) of libxml2, then the destructor will point to different xmlDocFree functions for both, and lxml cannot determine if it can safely set the destructor to NULL or not. I guess it would be ok to copy the document in that case, but it's actually not entirely unlikely that that happens, given that many people use statically linked wheels these days (manylinux, win32, ...). If you can live with a copy in those cases, I think we're settled on the main details for now.

Revision history for this message
Kovid Goyal (kovid) wrote :

That is fine by me. In my application I can guarantee the same version of libxml2 is used both by the parser module and by libxml2. In the general case, if people cannot guarantee that, they will have to take the performance hit.

Revision history for this message
scoder (scoder) wrote :

We could also use the capsule's "context" to point to a C string "destructor:xmlFreeDoc" to explicitly indicate that the cleanup uses the default case. That should then avoid the copy in almost all cases.

Revision history for this message
Kovid Goyal (kovid) wrote :

Sounds good. You can specify that the name must be libxml2:xmlDoc and if the user is sure about the libxml2 used versions being compatible, then the context should be set to destructor:xmlFreeDoc for best performance. When the context is set correctly, lxml can take ownership, otherwise it can copy.

Revision history for this message
Kovid Goyal (kovid) wrote :

Status on this?

Revision history for this message
scoder (scoder) wrote :

I didn't see a pull request for this on github yet. :)

Revision history for this message
Kovid Goyal (kovid) wrote :

As I said in my first post:

"I'd be willing to implement this myself if you can give me a few pointers on where to get started in the lxml codebase."

Revision history for this message
scoder (scoder) wrote :

I can provide the implementation if you can come up with a set of suitable tests. It should be possible to use cython.inline() for that (if Cython is installed and Py>=2.7). It's a bit underdocumented, you might have to read the source code. But it allows you to write the code in Cython and get it compiled from a Python string at runtime. The build setup should be available via lxml.get_include() and you can define the linker flags via distutils arguments.

http://docs.cython.org/en/latest/src/reference/compilation.html#compiling-with-cython-inline

API function: adopt_external_document(capsule) -> ElementTree
Capsule name: (const char*) "libxml2:xmlDoc"
Fast capsule context: (const char*) "destructor:xmlFreeDoc"

Revision history for this message
scoder (scoder) wrote :
Revision history for this message
Kovid Goyal (kovid) wrote :

OK, I'll look into it in a few days, thanks.

Revision history for this message
Kovid Goyal (kovid) wrote :

Created pull request with tests as requested: https://github.com/scoder/lxml/pull/1

scoder (scoder)
Changed in lxml:
milestone: none → 3.8.0
status: Triaged → Fix Committed
Revision history for this message
Kovid Goyal (kovid) wrote :

Could you please make a release of lxml with this function, presumably 3.8.0. I have implemented a gumbo based parser using it, but I cannpt make my code public until a version of lxml with this feature is released.

Revision history for this message
scoder (scoder) wrote :

done

scoder (scoder)
Changed in lxml:
status: Fix Committed → Fix Released
Revision history for this message
Kovid Goyal (kovid) wrote :

Thanks. My parser is at: https://github.com/kovidgoyal/html5-parser

Still a couple of things left to implement, such as proper namespace support for svg and mathml and also get it to run the html5lib tests.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.