[Enhancement] Add support for Beautiful Soup 4

Bug #1247222 reported by TomasHnyk
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Won't Fix
Undecided
Unassigned

Bug Description

Calibre bundles Beautiful Soup 3.
Beautiful Soup 4 has been stable for a year and a half now, according to its website.
I know that I could import v4 manually, but I am writing a recipe that I would like to share with other users and so I do not want to rely on functionality that they might lack.
v4 seems to be an improvement (and both can be used in parallel). I especially like the inser_after and insert_before methods and it can also reliably extract all visible text with the text method (useful for conuting words of an article). These would certainly simplyfy my code a lot.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1247222

I'm afraid I am not going to add code to the calibre distribution to
support one recipe. Note that most of calibre (apart from the recipe
system, for legacy reasons) has moved to using html5lib and lxml for
html parsing. I suggest you do the same in your recipe. See the builtin
recipes for Caravan magazine or toi or houston chronicle for examples.

In lxml inserting before or after a node is as simple as:

node.getparent.insert(node.getparent().index(node), other_node)
and you can use the full power of xpath to select nodes.

 status wontfix

Changed in calibre:
status: New → Won't Fix
Revision history for this message
TomasHnyk (sup) wrote :

Makes sense. I did not know about html5lib and lxml at all. I thought Beautiful Soup was the officialy recommended way. Would you please update the documentation here: http://manual.calibre-ebook.com/news.html to reflect that? I followed that and if I knew this, I would have started developing my recipe with the recommended options right away.

Revision history for this message
Kovid Goyal (kovid) wrote :

Sure, I will add a note about using alternate parsers in parse_index.

Revision history for this message
TomasHnyk (sup) wrote :

Is/will there be support for passing lxml.html.HtmlElement by preprocess_html and other similar methods that expect soup instances right now?

I can use lxml like this:
 def preprocess_html(self,soup):
     raw = u''.join(unicode(a) for a in soup.contents)
     root = html.fromstring(raw)
    # do my stuff
    return BeautifulSoup(etree.tostring(root,encoding=unicode))

But it is ugly and it seems to bring some problems (some closing </divs> get missing, no idea why really). If html5lib and lxml is the way forward, it would be nice if it could be used everywhere in recipes (you were right, it is pretty handy). preprocess_html could take an optional boolean argument "pass_HtmlElement" and it could be used concurently with soup so old recipes would not break.

Revision history for this message
TomasHnyk (sup) wrote :

Also, does calibre bundle lxml and html5lib? One (I do not remember which) of those two was no installed on my system (Ubuntu 13.10) - if I am to rely on its functionality, I would like to be sure the other users will have it installed.

I.e. if I could import those:
from lxml import html,etree
from lxml.builder import E
as I load BeautifulSoup from calibre like this:
from calibre.ebooks.BeautifulSoup import BeautifulSoup,Tag
that would be great. Or is it already possible?

Revision history for this message
Kovid Goyal (kovid) wrote :

It is bundled. And no, I have no interest in changing the recipe system
to not use beautiful soup, it's too much work for too little gain. Note
that you can use the JavascriptRecipe class instead of BasicNewsRecipe
if you really want to use lxml everywhere -- though then you will be
running a full headless browser to do the downloads in.

Revision history for this message
TomasHnyk (sup) wrote :

Good to know. I was not asking to change the system not to use BeautifulSoup, but to add the option to use lxml throughout. I think it would simplify matters for new recipe developers as they would not have to worry about BeautifulSoup at all.

I only do not understand why you recommended in comment #1 to use lxml for my recipe if it in fact is not supported. And if you are not going to (eventually) change the recipe system to use lxml, I do not understand why you do not upgrade BeautifulSoup to current version. Using lxml for feed parsing and BeautifulSoup for preprocessing makes little sense.

Oh I see, it is bundlel. lxml.__path__ gives a path to local calibre installation. I thought I was using my distribution package all along. It seems to me to mame more sense to put all bundled packages to ./calibre/lib/python2.7/site-packages/calibre so that they would be imported by
import calibre.pacakge instead of
import package
But maybe it is just me who was confused.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: [Bug 1247222] Re: [Enhancement] Add support for Beautiful Soup 4

On Wed, Nov 27, 2013 at 10:42:21AM -0000, TomasHnyk wrote:
> Good to know. I was not asking to change the system not to use
> BeautifulSoup, but to add the option to use lxml throughout. I think it
> would simplify matters for new recipe developers as they would not have
> to worry about BeautifulSoup at all.

Sure, but to do that, one would have to change the entire system. There
are no free lunches in life. And, as I already pointed out, there is
such an option, use JavascriptRecipe instead of BasicNewsRecipe.

>
> I only do not understand why you recommended in comment #1 to use lxml
> for my recipe if it in fact is not supported. And if you are not going

I recommended using it in parse_index, where is is supported just fine.

> to (eventually) change the recipe system to use lxml, I do not
> understand why you do not upgrade BeautifulSoup to current version.

Because BS 4 is not a superset of BS 3. Changing the HTML parser for a
system that parses thousands of websites is not something to be done lightly.

One of the key principles of writing software successfully is to not
change a working system just for "new and shiny".

> Using lxml for feed parsing and BeautifulSoup for preprocessing makes
> little sense.

Makes perfect sense to me. The amount of effort to re-write over a thousand
working recipes just to accomplish a change that has no actual end-user
effect in order to support the use case of one person that finds it
difficult to write both lxml and BS code is completely unjustifiable.

This whole thread is a completely pointless exercise. calibre has thousands
of recipes contributed by hundreds of people, plenty of evidence that
writing recipes is easy enough for the vast majority of people.

>
> Oh I see, it is bundlel. lxml.__path__ gives a path to local calibre installation. I thought I was using my distribution package all along. It seems to me to mame more sense to put all bundled packages to ./calibre/lib/python2.7/site-packages/calibre so that they would be imported by
> import calibre.pacakge instead of
> import package
> But maybe it is just me who was confused.

Yes, it is just you who are confused. You cannot in general move
arbitrary python packages into non top level namespaces since they could
have absolute imports inside them.

Revision history for this message
TomasHnyk (sup) wrote :

Thanks fro the answer. Sure I can write it somehow and I can live with that (especially now that I learned how to use both lxml and BeautifulSoup:-)). I understand your concerns for backward compatibility, but on the other hand, if you do not upgrade at some point (with some transitional period when both versions can be used), it gets increasingly difficult to use the old versions. I mean, do you intend to use BS 3 five years from now on? By that time, every search on the internet will return examples with BS 4, it will get harder to use BS 3 on the developers machine for testing and so on.

But anyway, is not Beautiful Soup what makes using the recipes so slow? Sure, some time (one to two minutes on my connection) is spent downloading the content, but then it processes it even longer (it happens also with the economist recipe, so it is not only the recipe I am developing). They claim ( like here: http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html and here http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/ ) that it is ten or more times slower then lxml. That would actually be a user visible benefit:-).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.