lxml

Bug #1595781
Comment #8

Comment 8 for bug 1595781

Revision history for this message

scoder (scoder) wrote on 2016-09-03:

Ah, ok, but your example is using the html5lib parser. For that, 166ms is actually not bad at all. What I was thinking of was to use your parser that apparently generates a libxml2 tree, then let libxml2 serialise it to a UTF-8 string, and then let lxml parse that. That should be quite quick and might still beat html5lib.

Anyway, given that you seem to be struggling hard to improve the parsing performance, I guess that loosing time in a parse-serialise-parse cycle is not going to make you happy.

Is that custom HTML5 parser available somewhere? Could it be merged into libxml2, for example?

Using a capsule for passing the tree around seems ok. lxml could require a specific name for the capsule, e.g. "libxml2_html_xmlDocPtr", to make it explicit what the content is and how it should be handled. Basically, anything that can execute C code is responsible for its own crashes, so if external native code is able to forge a libxml2 tree, letting it pass that into lxml doesn't subtract anything from the safety level.

Is there a reason why you'd need the "free_function", though ? I cannot imagine that there is code out there that creates an xmlDoc with anthing but xmlDocNew(). Or is your intention to apply additional cleanup measures to the document? That is a problem all by itself, though, right? I assume that the capsule needs to own the document while it is being passed around. If so, what if the user receives the capsule and throws it away immediately? I wouldn't want that case to leak the entire document memory.

That means that lxml must *always* make a copy of what it receives, and the capsule must *always* clean up what it holds and tie the document memory lifetime to its own. Unless we do the ownership transfer at the C level, as I suggested...