Bug #1247222 “[Enhancement] Add support for Beautiful Soup 4” : Bugs : calibre

Revision history for this message

Kovid Goyal (kovid) wrote on 2013-11-02: Re: calibre bug 1247222

#1

I'm afraid I am not going to add code to the calibre distribution to
support one recipe. Note that most of calibre (apart from the recipe
system, for legacy reasons) has moved to using html5lib and lxml for
html parsing. I suggest you do the same in your recipe. See the builtin
recipes for Caravan magazine or toi or houston chronicle for examples.

In lxml inserting before or after a node is as simple as:

node.getparent.insert(node.getparent().index(node), other_node)
and you can use the full power of xpath to select nodes.

status wontfix

Changed in calibre:
status:	New → Won't Fix

Revision history for this message

TomasHnyk (sup) wrote on 2013-11-02:

#2

Makes sense. I did not know about html5lib and lxml at all. I thought Beautiful Soup was the officialy recommended way. Would you please update the documentation here: http://manual.calibre-ebook.com/news.html to reflect that? I followed that and if I knew this, I would have started developing my recipe with the recommended options right away.

Revision history for this message

Kovid Goyal (kovid) wrote on 2013-11-03:

#3

Sure, I will add a note about using alternate parsers in parse_index.

Revision history for this message

TomasHnyk (sup) wrote on 2013-11-27:

#5

Is/will there be support for passing lxml.html.HtmlElement by preprocess_html and other similar methods that expect soup instances right now?

I can use lxml like this:
def preprocess_html(self,soup):
     raw = u''.join(unicode(a) for a in soup.contents)
     root = html.fromstring(raw)
    # do my stuff
    return BeautifulSoup(etree.tostring(root,encoding=unicode))

But it is ugly and it seems to bring some problems (some closing </divs> get missing, no idea why really). If html5lib and lxml is the way forward, it would be nice if it could be used everywhere in recipes (you were right, it is pretty handy). preprocess_html could take an optional boolean argument "pass_HtmlElement" and it could be used concurently with soup so old recipes would not break.

Revision history for this message

TomasHnyk (sup) wrote on 2013-11-27:

#6

Also, does calibre bundle lxml and html5lib? One (I do not remember which) of those two was no installed on my system (Ubuntu 13.10) - if I am to rely on its functionality, I would like to be sure the other users will have it installed.

I.e. if I could import those:
from lxml import html,etree
from lxml.builder import E
as I load BeautifulSoup from calibre like this:
from calibre.ebooks.BeautifulSoup import BeautifulSoup,Tag
that would be great. Or is it already possible?

Revision history for this message

Kovid Goyal (kovid) wrote on 2013-11-27:

#7

It is bundled. And no, I have no interest in changing the recipe system
to not use beautiful soup, it's too much work for too little gain. Note
that you can use the JavascriptRecipe class instead of BasicNewsRecipe
if you really want to use lxml everywhere -- though then you will be
running a full headless browser to do the downloads in.

Revision history for this message

TomasHnyk (sup) wrote on 2013-11-27:

#8

Good to know. I was not asking to change the system not to use BeautifulSoup, but to add the option to use lxml throughout. I think it would simplify matters for new recipe developers as they would not have to worry about BeautifulSoup at all.

I only do not understand why you recommended in comment #1 to use lxml for my recipe if it in fact is not supported. And if you are not going to (eventually) change the recipe system to use lxml, I do not understand why you do not upgrade BeautifulSoup to current version. Using lxml for feed parsing and BeautifulSoup for preprocessing makes little sense.

Oh I see, it is bundlel. lxml.__path__ gives a path to local calibre installation. I thought I was using my distribution package all along. It seems to me to mame more sense to put all bundled packages to ./calibre/lib/python2.7/site-packages/calibre so that they would be imported by
import calibre.pacakge instead of
import package
But maybe it is just me who was confused.

Revision history for this message

Kovid Goyal (kovid) wrote on 2013-11-27: Re: [Bug 1247222] Re: [Enhancement] Add support for Beautiful Soup 4

#9

On Wed, Nov 27, 2013 at 10:42:21AM -0000, TomasHnyk wrote:
> Good to know. I was not asking to change the system not to use
> BeautifulSoup, but to add the option to use lxml throughout. I think it
> would simplify matters for new recipe developers as they would not have
> to worry about BeautifulSoup at all.

Sure, but to do that, one would have to change the entire system. There
are no free lunches in life. And, as I already pointed out, there is
such an option, use JavascriptRecipe instead of BasicNewsRecipe.

>
> I only do not understand why you recommended in comment #1 to use lxml
> for my recipe if it in fact is not supported. And if you are not going

I recommended using it in parse_index, where is is supported just fine.

> to (eventually) change the recipe system to use lxml, I do not
> understand why you do not upgrade BeautifulSoup to current version.

Because BS 4 is not a superset of BS 3. Changing the HTML parser for a
system that parses thousands of websites is not something to be done lightly.

One of the key principles of writing software successfully is to not
change a working system just for "new and shiny".

> Using lxml for feed parsing and BeautifulSoup for preprocessing makes
> little sense.

Makes perfect sense to me. The amount of effort to re-write over a thousand
working recipes just to accomplish a change that has no actual end-user
effect in order to support the use case of one person that finds it
difficult to write both lxml and BS code is completely unjustifiable.

This whole thread is a completely pointless exercise. calibre has thousands
of recipes contributed by hundreds of people, plenty of evidence that
writing recipes is easy enough for the vast majority of people.

>
> Oh I see, it is bundlel. lxml.__path__ gives a path to local calibre installation. I thought I was using my distribution package all along. It seems to me to mame more sense to put all bundled packages to ./calibre/lib/python2.7/site-packages/calibre so that they would be imported by
> import calibre.pacakge instead of
> import package
> But maybe it is just me who was confused.

Yes, it is just you who are confused. You cannot in general move
arbitrary python packages into non top level namespaces since they could
have absolute imports inside them.

On Wed, Nov 27, 2013 at 10:42:21AM -0000, TomasHnyk wrote:
> Good to know. I was not asking to change the system not to use
> BeautifulSoup, but to add the option to use lxml throughout. I think it
> would simplify matters for new recipe developers as they would not have
> to worry about BeautifulSoup at all.

Sure, but to do that, one would have to change the entire system. There
are no free lunches in life. And, as I already pointed out, there is
such an option, use JavascriptRecipe instead of BasicNewsRecipe.

> 
> I only do not understand why you recommended in comment #1 to use lxml
> for my recipe if it in fact is not supported. And if you are not going

I recommended using it in parse_index, where is is supported just fine.

> to (eventually) change the recipe system to use lxml, I do not
> understand why you do not upgrade BeautifulSoup to current version.

Because BS 4 is not a superset of BS 3. Changing the HTML parser for a
system that parses thousands of websites is not something to be done lightly.

One of the key principles of writing software successfully is to not
change a working system just for "new and shiny".

> Using lxml for feed parsing and BeautifulSoup for preprocessing makes
> little sense.

Makes perfect sense to me. The amount of effort to re-write over a thousand
working recipes just to accomplish a change that has no actual end-user
effect in order to support the use case of one person that finds it
difficult to write both lxml and BS code is completely unjustifiable.

This whole thread is a completely pointless exercise. calibre has thousands
of recipes contributed by hundreds of people, plenty of evidence that
writing recipes is easy enough for the vast majority of people.

> 
> Oh I see, it is bundlel. lxml.__path__ gives a path to local calibre installation. I thought I was using my distribution package all along. It seems to me to mame more sense to put all bundled packages to ./calibre/lib/python2.7/site-packages/calibre so that they would be imported by
> import calibre.pacakge instead of 
> import package
> But maybe it is just me who was confused.

Yes, it is just you who are confused. You cannot in general move
arbitrary python packages into non top level namespaces since they could
have absolute imports inside them.

Revision history for this message

TomasHnyk (sup) wrote on 2013-11-27:

#10

Thanks fro the answer. Sure I can write it somehow and I can live with that (especially now that I learned how to use both lxml and BeautifulSoup:-)). I understand your concerns for backward compatibility, but on the other hand, if you do not upgrade at some point (with some transitional period when both versions can be used), it gets increasingly difficult to use the old versions. I mean, do you intend to use BS 3 five years from now on? By that time, every search on the internet will return examples with BS 4, it will get harder to use BS 3 on the developers machine for testing and so on.

But anyway, is not Beautiful Soup what makes using the recipes so slow? Sure, some time (one to two minutes on my connection) is spent downloading the content, but then it processes it even longer (it happens also with the economist recipe, so it is not only the recipe I am developing). They claim ( like here: http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html and here http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/ ) that it is ten or more times slower then lxml. That would actually be a user visible benefit:-).

calibre

[Enhancement] Add support for Beautiful Soup 4

Bug Description

Other bug subscribers

Remote bug watches