Statistics algorithm for sorting ratings looks fishy

Bug #894468 reported by Scott Ritchie
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
software-center (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Here's the current code snippet for sorting the Software Center Ratings:

def wilson_score(pos, n, power=0.2):
    if n == 0:
        return 0
    z = pnormaldist(1-power/2)
    phat = 1.0 * pos / n
    return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

def calc_dr(ratings, power=0.1):
    '''Calculate the dampened rating for an app given its collective ratings'''
    if not len(ratings) == 5:
        raise AttributeError('ratings argument must be a list of 5 integers')

    tot_ratings = 0
    for i in range (0,5):
        tot_ratings = ratings[i] + tot_ratings

    sum_scores = 0.0
    for i in range (0,5):
        ws = wilson_score(ratings[i], tot_ratings, power)
        sum_scores = sum_scores + float((i+1)-3) * ws

    return sum_scores + 3

This looks very fishy to me, as we are calculating 5 different wilson scores per rating and summing them. This is slow, and probably wrong. I'm not 100% sure about what the right method to use is, however I did find the question asked on Math Overflow:

http://mathoverflow.net/questions/20727/generalizing-the-wilson-score-confidence-interval-to-other-distributions

The current answer there suggests using a standard normal distribution for large samples, and a T-distribution for low ones (we don't do either)

This website suggests a slightly different Wilson algorithm:
http://www.goproblems.com/test/wilson/wilson.php?v1=0&v2=0&v3=3&v4=2&v5=4

I will go further, and assert that we are making a conceptual error in trying to estimate a mean rating in the first place: ratings are fundamentally ordinal data, and thus a mean doesn't make much sense for the same reason that "excellent" + "terrible" does not balance out to "mediocre". However, taking medians and percentile data is very much valid measurement.

I will research this question a bit more, and probably post a question on the beta stats stackexchange site for advice. Intuitively, though, I think we may want to have a ratings algorithm that sorts primarily based on median, and then for the large number of cases where two apps have the same median (since we only have 5 ratings), we then compute a wilson score for the lower bound of the probability that a rater of App A would rate >= median vs < median.

Revision history for this message
Scott Ritchie (scottritchie) wrote :
Revision history for this message
Matthew Paul Thomas (mpt) wrote :

To reverse a quote from Donald Knuth: "Beware of bugs in the above code; I have only tried it, not proved it correct."

The current formula produces results that behave the way we were wanting them to behave: the more ratings there are, the closer they get to their raw mean as opposed to 3.0. <https://wiki.ubuntu.com/SoftwareCenter#top-rated> It's quite possible that it is not exactly correct, and if making it exactly correct does not involve massive computation, certainly we should fix it. So, thanks for posting that question.

I disagree, though, that "ratings are fundamentally ordinal data" necessarily leads to "a mean doesn't make much sense". For example, shoe sizes and off-the-rack dress sizes are ordinal, but it still makes sense to talk about whether the average for either of those is increasing over time. One "Excellent" rating plus one "Terrible" rating does not lead to "Mediocre", but rather "We can't say with any confidence that this is better or worse than mediocre".

Revision history for this message
Scott Ritchie (scottritchie) wrote :

I thought shoe sizes and rack dress sizes were interval data, not ordinal - the difference between a size 10 and 11 was the same as a 5 and 6 (some constant number of centimeters). You're quite right that the 0 point is frequently arbitrary, however this still makes means and variance and such valid measures.

To defend my assertion of ordinality, consider a user who rates apps a 1 star if they crash instantly, and then if it works somewhere between 2 and 5 stars depending on quality, with 4 stars being "good" and 5 stars being "favorite". If there were a good chunk of users rating in this fashion then it's very silly to claim that some percentage of users promoting the app from a 1 to a 2 is the same as some percentage promoting it from a 4 to a 5.

Put briefly, one star could be the equivalent of a size 0 dress on a size 14 woman, while 2 through 5 stars were sizes 10-14.

Revision history for this message
Scott Ritchie (scottritchie) wrote :

This is just a thought but wouldn't it make sense to do the average rating computation server side and just display it in the client? We could download the "average" at the same time we currently download the rating set. That would save us basically all the statistical computation time (and the need to import stats modules).

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

If there was evidence that many people were rating applications the odd way you suggest, *and* that this was distorting dampened ratings, that would be a problem. But that's just speculation, until someone surveys reviewers to ask them.

I have no opinion on whether the DR should be calculated on the server vs. the client.

Revision history for this message
Scott Ritchie (scottritchie) wrote :

In the absence of evidence, it's far safer to assume voters aren't rating consistently with one another. That's what ordinal data implies. It also doesn't suffer from this problem of perverse incentives by over/undervoting to get extra influence.

Under a mean system, when I can see the rating is currently a 4 and I think an app deserves to be a 3, I can strategically vote a 1 to get three times the voting power as if I voted honestly. Under a median system this doesn't work - my incentive is always honest.

Some other nice properties of using the median as a primary sort:

1) If a majority of voters give an app a particular rating, the computed rating will always be within 0.5 points of that.
2) If a majority of voters give an app at least a particular rating, the computed rating will be at least 0.5 less than that.

Neither of these are true in the current system - it's possible for a majority of voters to give an app 4 or higher and that app to be rated 3 (or worse!)

You mentioned that the ratings currently look about right, which is justifiably a good reason to be conservative here. I think the median actually is the conservative choice - it's the correct measure when you're looking for wisdom of crowd phenomenon, it's the correct winner under most real-life voting systems (http://en.wikipedia.org/wiki/Median_voter_theorem), and it also happens to be about the same (in most cases) as the current system.

I'll parse the data we have a bit and make a few examples of what ratings currently are, what they would be, and under what circumstances they might change.

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

If people aren't rating consistently with one another, that is a good thing; it means they aren't all rating in the odd way you suggested earlier. ;-)

Using the raw median would not have the dampened effect we want: a single rating of 5 stars would shoot an application to the top of the list, when it should not. Perhaps there's a way to dampen the median?

Revision history for this message
Scott Ritchie (scottritchie) wrote :

I had thought I proposed a dampening algorithm above, but maybe it was on the Cross Validated post. Anyway, to be clear:

1) Compute the median and count 3 numbers: ratings less than the median (x), equal to the median (y), and above the median (z)
2) For large sample sizes, your score approaches the median plus the probability a voter rates above the median minus the probability a voter rates below it. For large sample sizes, this approaches median + (z/(x+y+z)) - (x/(x+y+z)), which equals median + (z-x)/(x+y+z).
3) Since we're talking about estimates of probabilities, rather than using a proportion as an estimator we use the bounds of the wilson score to estimate it. If we use the lower bound of the probability range on the positive end and the upper bound of the probability range on the negative end, we'd have something analogous to what we do now: new apps with small sample sizes are punished slightly, but as votes accrue they approach what's listed in 2.

Note that the probabilities in question cannot exceed 0.5 (even given a situation where very few people are rating at the actual median), which means we're neither adding nor subtracting more than half away from the median. This makes the sample median the dominant sorting method, but nevertheless the other ratings are important.

Some quick examples this would produce:
 - If an app received 70% median votes of 3, 10% below median, and 20% above median, it would rate about 3.1.
 - If an app had just 10 votes in the same proportion as above, it would receive a bit lower rating since the error bars on the wilson score would be higher.
 - If an app received some votes of 4 and about equal numbers of 5 and 3, it would be rated about 4.0, since the positive and negative portions would cancel eachother out.
 - If an app received almost entirely votes of 4, it would also receive about a 4.0 rating.
 - If an app received a lot but approximately equal number of 3's and 4's, it would be rated about 3.5 -- either because its median was 3 and there was about a .5 chance someone would rate higher, or because it's median was 4 and there was about a .5 chance someone would rate lower.

Revision history for this message
Scott Ritchie (scottritchie) wrote :

To address the specific example of having only a single rating of 5, the algorithm would spit out about a 4.5 (5 is the sample median, but since the sample size is so low there is a very wide error bar on the probability estimate a new voter would rate below that, and we take the lower bound)

tags: added: client-server
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks to both Scott and Matthew for your detailed discussion. I am not concerned about the speed of the algorithm as its relatively simple math. The conceptual points that are made are worth more investigation but given the nature of the problem
it will require a bit of time to think this through and also to run the tweaked algorithm on our sample data. So I mark this as triaged. I think its a fascinating problem and deserves some more investigation even if the result might be that we just keep
our current approach.

Changed in software-center (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Scott Ritchie (scottritchie) wrote :

The Wilson score's method seems to have been inspired by this post, "How not to Sort by Average Rating"

http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Here is a better method for star ratings from the same author:

http://www.evanmiller.org/ranking-items-with-star-ratings.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.