[HN Gopher] How not to sort by average rating (2009) ___________________________________________________________________ How not to sort by average rating (2009) Author : soheilpro Score : 190 points Date : 2021-11-12 15:23 UTC (7 hours ago) (HTM) web link (www.evanmiller.org) (TXT) w3m dump (www.evanmiller.org) | rkuykendall-com wrote: | This article inspired me so much that I based my shitty undergrad | senior thesis on it. My idea was to predict the trend of the | ratings by using I think a trailing weighted average, weighted to | the most recent window. It managed to generate more "predictive" | ratings of the following 6 months based on the Amazon dataset I | used, but I doubt it would have held up to much scrutiny. I | learned a ton though! | | Edit: Link to paper, which looks like it actually attempts to use | a linear prediction algorithm. | https://github.com/rkuykendall/rkuykendall.com/blob/e65147f6... | kazinator wrote: | This still has the problem that some item with 12 votes will be | ranked higher than some item with 12,000 votes. Oh, and also has | the problem that some item with 12 votes will be ranked lower | than some item with 12,000 votes. | | I think you simply need separate search categories for this. | | Say I want to look for underrated or undiscovered gems: | | "Give me the best ranked items that have 500 votes or less." | | It is misleading to throw a 12 vote item together into the same | list as a 12,000,000 vote item, and present them as being ranked | relative to each other. | taormina wrote: | This is a blast from the past. It's also surprisingly simple to | implement his "correct" sort. Seriously, this link should make | the rounds every year or so here. | [deleted] | hwbehrens wrote: | While I agree with the author in principle, I think there is an | implicit criteria they ignore, which is the intuitive correctness | from the perspective of the user. | | Imagine a user chooses "Sort by rating", and they subsequently | observe an item with an average 4.5 ranking above a score of 5.0 | because it has a higher Wilson score. Some portion of users will | think "Ah, yes, this makes sense because the 4.5 rating is based | on many more reviews, therefore its Wilson score is higher." and | the vast, vast majority of users will think "What the heck? This | site is rigging the system! How come this one is ranked higher | than that one?" and erode confidence in the rankings. | | In fact, these kinds of black-box rankings* frequently land sites | like Yelp into trouble, because it is natural to assume that the | company has a finger on the scale so to speak when it is in their | financial interests to do so. In particular, entries with a | higher Wilson score are likely to be more expensive because their | ostensibly-superior quality commands (or depends upon) their | higher cost, exacerbating this effect due to perceived-higher | margins. | | So the next logical step is to present the Wilson score directly, | but this merely shifts the confusion elsewhere -- the user may | find an item they're interested in buying, find it has one 5-star | review, and yet its Wilson score is << 5, producing at least the | same perception and possibly a worse one. | | Instead, providing the statistically-sound score but de- | emphasizing or hiding it, such as by making it accessible in the | DOM but not visible, allows for the creation of alternative | sorting mechanisms via e.g. browser extensions for the | statistically-minded, without sacrificing the intuition of the | top-line score. | | * I assume that most companies would choose not to explain the | statistical foundations of their ranking algorithm. | jkaptur wrote: | That's a really good point. I wonder if folks would intuitively | get it if you provided little data visualization (visible on | hover or whatever). Like: | | Result 1: (4.5 ) | | Result 2: (5.0 ) | | edit: HN stripped out the unicode characters :(. I was using | something like this: https://blog.jonudell.net/2021/08/05/the- | tao-of-unicode-spar.... | SerLava wrote: | You could probably get around this by | | A) labelling 1-2 review items with "needs more reviews" message | | Or B) not giving an aggregate review score for low review | items. Actually _replacing_ the review star bar with "needs | more reviews". Then when the user goes from the listing page to | the detail page, you can show the reviews next to a message | saying "this item only has a few reviews, so we can't be sure | they're accurate until more people chime in" | fennecfoxen wrote: | C) normalizing the display of stars to the score | nkrisc wrote: | I worked on an e-commerce site that attempted to solve the | issue by simply not giving an average rating to an item until | it had a certain amount of reviews. We still showed the reviews | and their scores, but there was no top level average until it | had enough reviews. We spent a lot of time in user testing and | with surveys trying to figure it how to effectively communicate | that. | jahewson wrote: | I think this can be solved with better UI: Instead of stars, | show a sparkline of the distribution the of scores. The user | can then see the tiny do representing the single 5 star review | and the giant peak representing the many 4 star reviews. | 1024core wrote: | This is a UX problem, which can be solved by not showing the | exact rating, but showing a "rating score" which is the Wilson | score. | alecbz wrote: | OP addressed that: | | > So the next logical step is to present the Wilson score | directly, but this merely shifts the confusion elsewhere -- | the user may find an item they're interested in buying, find | it has one 5-star review, and yet its Wilson score is << 5, | producing at least the same perception and possibly a worse | one. | | Though I'm not convinced how big of a deal this is. Even if | you're worried about this, a further optimization may be to | simply not display the score until there's enough reviews | that it's unlikely anyone will manually compute the average | rating. | dfabulich wrote: | In another article, the author (Evan Miller) recommends not | showing the average unless there are enough ratings. You would | say "2 ratings" but not show the average, and just sort it | wherever it falls algorithmically. | | https://www.evanmiller.org/ranking-items-with-star-ratings.h... | | In that article, he even includes a formula for how many | ratings you'd need: | | > _If you display average ratings to the nearest half-star, you | probably don't want to display an average rating unless the | credible interval is a half-star wide or less_ | | In my experience, the second article is more generally useful, | because it's more common to sort by star rating than by thumb- | up/thumb-down ranking, which is what the currently linked | article is about. | | And the philosophical "weight on the scale" problem isn't as | bad as you'd think when using these approaches. If you see an | item with a perfect 5-star average and 10 reviews ranked below | an item with a 4.8-star average and 1,000 reviews, and you call | the sort ranking "sort by popularity," it's pretty clear that | the item with 1,000 reviews is "more popular." | sdwr wrote: | Not having faith in the user is a giant step towards | mediocrity. Does a weighted average provide better results? | Then use a weighted average! The world isn't split into an | elite group of power users and the unwashed masses. There are | just people with enough time and attention to fiddle with | browser extensions, and everyone else. And all of them want the | best result to show up first. | | Yelp didn't get dinged because their algorithms were hidden. | They lost credibility because they were extorting businesses. | Intention matters. | enlyth wrote: | I don't think this is an easy problem to solve. | | The inherent problem to me seems like we're trying to | condense reviews into a tiny signal of an integer in the | range of 1 to 5. | | For many things, this simply doesn't cut it. | | 2 stars, what does that mean? Was the coffee table not the | advertised shade of grey? Does the graphics card overheat on | medium load because of a poor cooler design? Was the delivery | late (not related to the product, but many people leave these | kinds of reviews)? Did you leave a 2 star review because you | don't like the price but you didn't actually order the | product? | | All these things I've seen on reviews and I've learned to | ignore star ratings because not only they can be gamed, they | are essentially useless. | | Props to users who take the time to write out detailed | reviews of products which give you an idea of what to expect | without having to guess what a star rating means, although | sometimes these can be gamed as well as many sellers on | Amazon and such will just give out free products in exchange | for favourable reviews. | | Being a consumer is not easy these days, you have to be | knowledgeable in what you're buying and assume every seller | is an adversary. | strken wrote: | The problem with having faith in your users is you have to | actually do it. If you're sorting by Wilson score when the | user clicks a column that displays a ranking out of five, | then you're mixing two scores together in a frustrating way | because you think your users are too dumb to understand. | | There has to be a way to let users choose between "sort by | rating, but put items without many reviews lower" and "sort | by rating, even items with only one or two reviews" in a way | that helps give control back to them. | sdwr wrote: | The way I've seen it done is a single column with avg stars | + # reviews, which isn't clickable, because why would you | want to sort by minimum ranking? | IggleSniggle wrote: | If you don't provide a "Sort by rating" option but instead | include options like sort by "popularity," "relevance," | "confidence," or similar, then it is more accurate description, | more useful to the user, and not so misleading about what is | being sorted. | | I agree that if I "sort by rating" then an average rating sort | is expected. The solution is to simply not make sorting by | rating an option, or to keep the bad sorting mechanism but de- | emphasize it in favor of the more useful sort. Your users will | quickly catch on that you're giving them a more useful tool | than "sort by average rating." | crooked-v wrote: | I think you're overemphasizing the confusion that an alternate | ranking schema would cause. We have Rotten Tomatoes as a very | obvious example of one that a lot of people are perfectly happy | with even though it's doing something very different from the | usual meaning of X% ratings. | | I feel like all that's really needed is a clear indicator that | it's some proprietary ranking system (for example, | "Tomatometer" branding), plus a plain-language description of | what it's doing for people who want to know more. | tablespoon wrote: | > Imagine a user chooses "Sort by rating", and they | subsequently observe an item with an average 4.5 ranking above | a score of 5.0 because it has a higher Wilson score. Some | portion of users will think "Ah, yes, this makes sense because | the 4.5 rating is based on many more reviews, therefore its | Wilson score is higher." and the vast, vast majority of users | will think "What the heck? This site is rigging the system! How | come this one is ranked higher than that one?" and erode | confidence in the rankings. | | It also erodes confidence in ratings when something with one | fake 5 star review sorts above something else with 1000 reviews | averaging 4.9. | | I think you're mainly focusing on the very start of a learning | curve, but eventually people get the hang of the new system. | Especially if it's named correctly (e.g. "sort by review-count | weighted score"). | mandelbrotwurst wrote: | I'd opt for a simpler and less precise name like "Sort by | Rating", but then offer the more precise definition via a | tooltip or something, to minimize complexity for the typical | user but ensure that accurate information is available for | those who are interested. | nkrisc wrote: | Better in my opinion to give an item a rating until it has | some number of reviews. You can still show the reviews, but | treat it as unrated. | dfabulich wrote: | I prefer to call it "Sort by Popularity." | mc32 wrote: | I don't like that measure because popularity doesn't | translate into "good". | | What's the most popular office pen? Papermate, Bic? I may | be looking for more quality. | | What's the most popular hotel in some city? Maybe I'm | looking for location or other aspects other than popularity | among college kids. | dfabulich wrote: | When you use the OP article's formula, you're sorting by | popularity. You may choose not to sort by popularity, but | when you use it, you should _call_ it sorting by | "popularity." | alecbz wrote: | This is a fair point but it's not as if knowing with items are | actually good is something that should only be available to | power users. The real goal ought to be: making sure your | customers get access to actually good things. Not merely | satisfying what might be some customers' naive intuition that | things with higher average ratings are actually better. | | I think there's better approaches that can be taken here to | address possible confusion. E.g., if the Wilson score rating | ever places an item below ones with higher average rating, put | a little tooltip next to that item's rating that says something | like "This item has fewer reviews than ones higher up in the | list." You don't need to understand the full statistical model | to have the intuition that things with only a few ratings | aren't as "safe". | giovannibonetti wrote: | In order to deal with that, I would place two sorting options | related to average: - regular average - weighted average | (recommended, default) | | Then the user can pick the regular average if they want, | whereas the so-called weighted average (the algorithm described | in the article) would be the default choice. | ChrisArchitect wrote: | Anything new here? | | Some previous discussions: | | _4 years ago_ https://news.ycombinator.com/item?id=15131611 | | _6 years ago_ https://news.ycombinator.com/item?id=9855784 | | _10 years ago_ https://news.ycombinator.com/item?id=3792627 | | _13 years ago_ https://news.ycombinator.com/item?id=478632 | | Reminder: you can enjoy the article without upvoting it | dang wrote: | Thanks! Macroexpanded: | | _How Not to Sort by Average Rating (2009)_ - | https://news.ycombinator.com/item?id=15131611 - Aug 2017 (156 | comments) | | _How Not to Sort by Average Rating (2009)_ - | https://news.ycombinator.com/item?id=9855784 - July 2015 (59 | comments) | | _How Not To Sort By Average Rating_ - | https://news.ycombinator.com/item?id=3792627 - April 2012 (153 | comments) | | _How Not To Sort By Average Rating_ - | https://news.ycombinator.com/item?id=1218951 - March 2010 (31 | comments) | | _How Not To Sort By Average Rating_ - | https://news.ycombinator.com/item?id=478632 - Feb 2009 (56 | comments) | oehpr wrote: | Maybe what we need here is an extension where you can filter | out articles? | | It adds a click event to each link for the article, and then | after a day has passed, will start filtering that link out from | HN results? I give it a gap of a day because maybe you'd want | to return and leave a comment. | | I might try my hand at a greasemonkey script if you're | interested. | | Though, personally, I have no great issue seeing high quality | posts again occasionally. | rdlw wrote: | This is a genuine question, is there an HN guideline that says | not to upvote reposts? | | I don't know if I knew about HN four years ago and if I did, I | almost certainly missed that post, and if I didn't, I certainly | don't remember the interesting discussion in the comments. | | I enjoyed the article and I'm not sure I see a reason not to | upvote it. | svnpenn wrote: | 4 years? I think that's fine for a repost. | chias wrote: | The new thing is another cohort of people getting to be today's | lucky 10,000. | edude03 wrote: | I'm one of them and I appreciate the repost | iyn wrote: | https://xkcd.com/1053/ | monkeybutton wrote: | I was just looking for some of his old blog posts about A/b | testing the other day. Since I first read them, I'd lost my | bookmarks and forgotten his name. Do you know how bad the | google search results for A/B testing are now? They're | atrocious! SEO services and low-content medium posts as far as | the eye can see! I was only able to rediscover his blog after | finding links to it in the readme of a random R project in | github. | mbauman wrote: | I'd love to see an update here that: | | * Included a graph of the resulting ordering of the two | dimensional plane and some examples | | * Included consideration of 5- or 10-star scales. | abetusk wrote: | They have an article about K-star rating systems [0] which uses | Bayesian approximation [1] [2] (something I know little to | nothing about, I'm just regurgitating the article). | | There's a whole section on their website that has different | statistics for programmers, including rating systems [3]. | | [0] https://www.evanmiller.org/ranking-items-with-star- | ratings.h... | | [1] | https://en.wikipedia.org/wiki/Approximate_Bayesian_computati... | | [2] https://www.evanmiller.org/bayesian-average-ratings.html | | [3] https://www.evanmiller.org/ ("Mathematics of user ratings" | section) | ScoutOrgo wrote: | The formula still works for scales of 5 or 10, you just have to | divide by the max rating first and then multiply by it again at | the end. | | For example a 3/5 stars turns into 0.6 positive and 0.4 | negative observation. Following the formula from there will | give a lower bound estimation between 0 and 1, so then you just | multiple by 5 again to get it between 0 and 5. | WalterGR wrote: | (2009) | karaterobot wrote: | Is there a better solution now? | WalterGR wrote: | No idea. It's customary to include the year in HN submission | titles if it was published before the current year. When I | made my comment, the title didn't include the year. | driscoll42 wrote: | One alternative is SteamDB's solution: | https://steamdb.info/blog/steamdb-rating/ | 1970-01-01 wrote: | My anecdotally accurate advice (AAA) is to always read 2-star | reviews before purchase. | actually_a_dog wrote: | Why 2 star? I get the whole "forget about the 5 star reviews, | because they're not going to tell you any of the downsides of | the product," and "forget the 1 star reviews, because they're | often unrelated complaints about shipping or delivery, and | generally don't tell you much about the product." But, why not | 3 star reviews? | | I generally pay the most attention to 3 star reviews, because | they tend to be pretty balanced and actually tell you the | plusses and minuses of the product. It seems like 2 star | reviews would be somewhat like that, but leaning toward the | negative/critical side. Is the negative/critical feedback what | you're after? | 1970-01-01 wrote: | Because therein I find the best explanations for product | failures. 3-star reviews tend to contain less failures and | more "this could have been much better if they ___" . Again, | it's anecdotal. I have no data to back my words. | gowld wrote: | "3 stars" means "meh, it's fine. I don't want to commit to | rating but I'm not a sucker who gives 5 to everything" | | "2 stars" means "I really don't like it, but I can control my | emotions and explain myself". | jedberg wrote: | Fun fact, this article inspired the sysadmin at XKCD to submit a | patch to open source reddit to implement this sort on comments. | It lives still today as the "best" sort. | | The blog post that explained it: | https://web.archive.org/web/20091210120206/http://blog.reddi... | bradbeattie wrote: | There are a number of approaches to this with increasing | complexity: | | - Sum of votes divided by total votes | | - More advanced statistical algorithms that take into account | confidence (as this article suggests) | | - Recommendation engines that provides a rating based on your | taste profile | | But I'm pretty sure you could take this further depending on what | data you're looking to feed in and what the end-users' | expectations of the system are. | voldemort1968 wrote: | Similarly, the problem of calculating "Trending" | https://smosa.com/adam/code-and-technology | chias wrote: | I've been using this at work for the last year or so to great | success. | | For example, we have an internal phishing simulation/assessment | program, and want to track metrics like improvement and general | uncertainty. Since implementing this about a year ago, we've been | able to make great improvements such as: | | * for a given person, identify the wilson lower bound that they | would _not_ get phished if they were targeted | | * for the employee population as a whole, determine the 95% | uncertainty on whether a sample employee would get phished if | targeted | | It lets us make much more intelligent inferences about things, | much more accurate risk assessments, and also lets us improve the | program pretty significantly (e.g. your probability of being | targeted being weighted by a combination of your wilson lower | bound and your wilson uncertainty). | | There are SO MANY opportunities to improve things by using this | method. Obviously it isn't applicable everywhere, but I'd suggest | you look at any metrics you have that use an average and just | take a moment to ask yourself if a Wilson bound would be more | appropriate, or might enable you to make marked improvements. | user5994461 wrote: | Sounds like people who don't read their emails would get the | best score because they don't get phished. | chias wrote: | Pretty much, yep :) They're also less likely to get phished | in general. | | Though this property may be suboptimal for other reasons. | anthony_r wrote: | This is cool. But what I usually do is replace x/y with x/(y+5), | and hope for the best :). The 5 can be replaced by 3 or 50, | depending on what I'm dealing with. | | (in less important areas than sorting things by ratings to | directly rank things for users; mentally bookmarked this idea for | the next time I need something better, as this clearly looks | better) | mattb314 wrote: | Heads up this weights all your scores towards 0. If you want to | avoid this, an equally simple approach is to use (x+3)/(y+5) to | weight towards 3/5, or any (x+a)/(y+b) to weight towards a/b. | It turns out that this seemingly simple method has some (sorta) | basis in mathematical rigor: you can model x and y as successes | and total attempts from a Bernoulli random variable, a and b as | the parameters in a beta prior distribution, and the final | score to be the mean of the updated posterior distribution: | https://en.wikipedia.org/wiki/Beta_distribution#Bayesian_inf... | | (I saw first this covered in Murphy's Machine Learning: A | Probabilistic Perspective, which I'd recommend if you're | interested in this stuff) | zzzeek wrote: | if you dont have PostgreSQL it might be hard to create an index | on that function. you can use a trigger that updates a fixed | field on the row each time positive/negative changes, or | otherwise run the calc and include it in your UPDATE statement | when those numbers change. | Waterluvian wrote: | You can't rate 0 stars so the entire range is shifted by 1 star. | This makes any star rating system fatally flawed to begin with. | | Humans will see 3 stars and not perceive that as 50%. | feoren wrote: | Is that really a _fatal_ flaw? It 's humans reading the | ratings, and humans doing the ratings, so our human-factors | might balance out a bit. I don't think people come in expecting | the rating system to be perfectly linear because we have a | mental model of how other humans rate things -- 1 star and 5 | stars are very common, even when there's obviously ways the | thing could be worse/better. So even though 3 stars sounds like | more than 50%, most people would consider 3.0 stars a very poor | rating. | Waterluvian wrote: | I think you make a good point. But I don't think it | completely defeats the bias. Especially given that the star | system that existed before the Web had 0 and half stars. | | It seems like it's purely a result of widget design | deficiency: how do you turn a null into a 0 with a star | widget? (You could add an extra button but naturally | designers will poo poo that) | Macha wrote: | Percentage systems aren't immune to this, various pieces of | games media were often accused of a 70-100% rating scale. | Anything below 70 was perceived as a terrible game, and they | didn't want to harm their relationship with publishers. So 70 | became the "You might like it if there are some specifics that | appeal to you" and 80 was a pretty average game. | WithinReason wrote: | IIRC, a simple approximation of that horrendous formula is : | | (positive)/(positive+negative+1) | | It rewards items with more ratings. Basically, you initialize the | number of negative ratings to 1 instead of 0. | akamoonknight wrote: | Very interesting, seems your remembering looks correct to me. | | x / (x+y+1) :: | https://www.wolframalpha.com/input/?i=plot+x+%2F%28x+%2B+y+%... | | horrendous formula :: | https://www.wolframalpha.com/input/?i=plot+%28%28x%2F%28x%2B... | | Much less prone to typos. | gowld wrote: | The main flaw in this formula is that when positive=0 the | negative votes have no weight. | rdlw wrote: | A heuristic I use when looking at products with low numbers | of reviews is to add one positive and one negative review, so | | (positive+1)/(positive+negative+2). | | This basically makes the 'default' rating 50% or 3 stars or | whatever, and votes move the rating from that default. | raldi wrote: | This is a decent approximation. It handles all the common | hazard cases: | | +10/-0 should rank higher than +1/-0 | | +10/-5 should rank higher than +10/-7 | | +100/-3 should rank higher than +3/-0 | | +10/-1 should rank higher than +900/-200 | DangerousPie wrote: | One of my sites has been using a ranking algorithm based on this | article for over 10 years now. Nobody ever complained, so it must | be pretty good. | truculent wrote: | A simpler solution: | | Weighted score = (positive + alpha) / (total + beta) | | In which alpha and beta are the mean number of positive and total | votes, respectively. You may wish to estimate optimal values of | alpha and beta subject to some definition of optimal, but I find | the mean tends to work well enough for most situations. ___________________________________________________________________ (page generated 2021-11-12 23:00 UTC)