lxml

Bug #1595781
Comment #9

Comment 9 for bug 1595781

Revision history for this message

Kovid Goyal (kovid) wrote on 2016-09-03:

No, the html5lib parsing happens only once on setup. See the -s flag used with timeit. The 166ms time is really for serialize+parse only. html5lib would takes ~15 seconds to parse that document (which is ridiculous, and the reason I want to replace it)

time python -c "import html5lib; html5lib.parse(open('single-page.html').read(), treebuilder='lxml')"
python -c 15.72s user 0.18s system 99% cpu 15.906 total

As for what parser I would use, there are several, the most out-of-the-box ready one would be https://github.com/nostrademons/gumbo-libxml

However, more interesting would be to adapt the HTML 5 parser in Rust from the Servo project.

In either case, my needs are not for a vanilla html 5 parser, I would need to modify it in some ways to make it able to parse both invalid XHTML 5 and HTML 5 which is what we have in the ebook world.

I dont need the free function myself, it's just there for the sake of completeness of the API. I'm fine if you leave it out.

The way the capsule works is that my parser module creates the libxml2 document and wraps it in a capsule with a NULL destructor. That means it will not be freed when the capsule is garbage collected. The capsule is then passed to lxml which takes over ownership of the libxml2 document and becomes respnosible of freeing it. So, no, lxml does not need to copy the tree.