Amazon source picks up ratings from ANY book on the page (suggested etc)

Bug #1245449 reported by danmb
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Unassigned

Bug Description

The XPaths in lib/calibre/calibre/ebooks/metadata/sources/amazon.py are too loose. They pick up ratings from ANY book on the analyzed page page. If the book itself is unrated, but one of the recommended books is, the parse_rating() function will inappropriately pick up the latter. Consider the following procedure instead (tested by me, explanation below):

    def parse_rating (self, root, isbn):
      # pick up link "N customer reviews" that links to the current book page
      nrev = root.xpath ("//a [contains (@href, '/{}')]/text() [contains (., ' customer review')]/..".format (isbn))
      if nrev: # number of reviews
        nrev = nrev [0]
        # find number of stars
        stars = nrev.xpath ("..//* [contains (@title, ' out of ') and contains (@title, ' stars')]/@title")
        nrev = nrev.xpath ("text()") [0].split (" ", 1) [0]
        if stars:
          stars = re.match ("([0-9.]+) out of ([0-9.]+) stars$", stars [0])
          if stars:
            stars = float (stars.group (1)) / float (stars.group (2))
            # return stars, nrevs

It picks up any link with the text "N customer reviews" *that links to the current book page* (not to some other book). Of course, this code needs to i18n etc.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1245449

If you wish to exclude the ratings from the recommended books, a better
approach is to detect and remove that section from root before running
the xpaths. The xpaths are loose for a reason, amazons various servers
and software generations across the worls serve up very different
markup.

Revision history for this message
danmb (danmbox) wrote :

Thanks for replying Kovid. I'm not sure I've gotten the problem across.

I think parse_ratings() will pull a rating for a recommended book **as if it were a rating for the current book**. Therefore an unrated book B1 could be displayed in Calibre as having 3.6 stars, just because Amazon recommends B2 (on B1's page), and B2 happens to have 3.6 stars. This is misleading (and a bug).

If Amazon starts recommending a different book B3 (on B1's page) later on, Calibre might switch randomly to pulling B3's ratings, and suddenly show B1 as having 2.3 stars.

My xpath is in a sense "looser" than the ones in amazon.py -- it doesn't look for "crAvgStars" or "averageCustomerReviews" or any specific names which might vary. It only looks for a link with the text "N customer reviews" **which points back to the current book**. There's no easy way to fix the ratings XPath, because those don't link to any book, whereas the "N customer reviews" text always links to a book (and you can check its ASIN / ISBN)

Revision history for this message
Kovid Goyal (kovid) wrote :

The problem is that there si no way to be sure that link and text are present on all versions of amazons servers across the globe. Therefore by changing the XPath, you run the risk of breaking rating fetching for some people.

What I am suggesting instead is that you remove the entire recommended books sectiton from the page before running the xpaths. That way, they cannot pick up incorrect ratings.

Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.