Sign In

Communications of the ACM


What Is a Good Recommendation Algorithm?

View as: Print Mobile App Share:
Geeky Ventures Founder Greg Linden

Netflix is offering one million dollars for a better recommendation engine.  Better recommendations clearly are worth a lot. 

But what are better recommendations?  What do we mean by better?

In the Netflix Prize, the meaning of better is quite specific.  It is the root mean squared error (RMSE) between the actual ratings Netflix customers gave the movies and the predictions of the algorithm.

Let's say we build a recommender that wins the contest.  We reduce the error between our predictions and what people actually will rate by 10% over what Netflix used to be able to do.  Is that good?

Depending on what we want, it might be very good.  If what we want to do is show people how much they might like a movie, it would be good to be as accurate as possible on every possible movie.

However, this might not be what we want.  Even in a feature that shows people how much they might like any particular movie, people care a lot more about misses at the extremes.  For example, it could be much worse to say that you will be lukewarm (a prediction of 3 1/2 stars) on a movie you love (an actual of 4 1/2 stars) than to say you will be slightly less lukewarm (a prediction of 2 1/2 stars) on a movie you are lukewarm about (an actual of 3 1/2 stars).

Moreover, what we often want is not to make a prediction for any movie, but find the best movies.  In TopN recommendations, a recommender is trying to pick the best 10 or so items for someone. It does not matter if you cannot predict what people will hate or shades of lukewarm.  The only thing that matters is picking 10 items someone will love.

A recommender that does a good job predicting across all movies might not do the best job predicting the TopN movies.  RMSE equally penalizes errors on movies you do not care about seeing as it does errors on great movies, but perhaps what we really care about is minimizing the error when predicting great movies.

There are parallels here with web search.  Web search engines primarily care about precision (relevant results in the top 10 or top 3).  They only care about recall when someone would notice something they need missing from the results they are likely to see.  Search engines do not care about errors scoring arbitrary documents, just their ability to find the top N documents.

Aggravating matters further, in both recommender systems and web search, people's perception of quality is easily influenced by factors other than the items shown.  People hate slow websites and perceive slowly appearing results to be worse than fast appearing results.  Differences in the information provided about each item (especially missing data or misspellings) can influence perceived quality.  Presentation issues, even color of the links, can change how people focus their attention and which recommendations they see.  People trust recommendations more when the engine can explain why it made them.  People like recommendations that update immediately when new information is available.  Diversity is valued; near duplicates disliked.  New items attract attention, but people tend to judge unfamiliar or unrecognized recommendations harshly. 

In the end, what we want is happy, satisfied users.  Will a recommendation engine that minimizes RMSE make people happy?


Andrei Lopatenko

I believe that the main point of this post is correct: the best RMSE is not equal to the best user satisfaction, but I am not sure that the TopN is the only one relevant metrics for the movie recommendation system. For example, TopN does not say anything about the diversity (if I LOVE French comedies with Pierre Richard, it does not mean that I want to watch only them this week, I want more suggestions in different genres), novelty, etc
I would expect good movie recommendation system to be a good 'exploration' interactive system, which could tell me why I may like this movie and why it is similar/different from the movie I like/dislike ( is a good example)

Eric Schwarzkopf

The movie rating example reminds me of utility theory - I really have to brush up on that but there might be some fitting models of utility that could be used to derive an improved quality measure of recommendations in certain domains.
I think the domain or user-need specificity of the quality measure is key here. I've got different requirements on a news filtering system than a movie recommendation system.
The former should keep me informed while consuming a minimum of my time and I don't really need an explanation of why something was recommended to me - except for when it's so far of that I've to figure out what corrective action to take.
The latter should assist me in figuring out in which movie to invest time and money, and I'm willing to invest some time up front to make a good decision. Here, diversity in the set of recommended movies and an explanation of the reasons for recommending a movie are welcome.

The account that made this comment no longer exists.

What makes a recommendation system great? In my mind the answer is simple. The best recommendation systems are the ones that engage the user and drive customer loyalty.

Things like RMSE over a test data set given a training set are at best crude proxies for this, and at worst completely miss the mark. Even metrics like click through rate, order size and conversion rate that just consider session-level behavior can be misleading. In my experience they tend to drive you towards recommendations that are not globally optimal in the long term.

The delicate balance is to be reactive to short-term trends in the market, but to do so with an eye towards driving long-term value via deep relationships with your customers.

I have this conversation with richrelevance's customers all the time, and I'm pleased that they share my commitment to building long-lasting relationships with their customers.

Ian Soboroff

Beyond how you interpret RMSE (or whatever metric you decide on), you really do have to to consider the user's task and the cost of a bad recommendation.

For a Netflix user, the cost of a bad recommendation is not so great. The risk of that bad recommendation (how bad does the recommedation have to be such that you still rent the movie have and still ruin your evening?) is also not so great.

I have long thought this is a perennial barrier for recommender research -- beyond how commercializable it might or might not be, there's only so far you can get trying to recommend movies. Recommenders are in use in lots of other domains, not all in product or media recommendation, but no research is being done there. Well, not a lot.

Jeremy Pickens

While I agree that user's generally want more 5-star movies and fewer 1-star movies, I disagree that this means recommendation is similar to TopN web search. Web search assumes very little interactivity, and once the user has found the one item/link he is looking for, he is done with the search activity.

With recommendations, on the other hand, people are more exploratory- and recall-oriented. I'll bet people don't just have 3 or 10 items in their Netflix queue. We would have to ask Netflix what that average queue length is, but anecdotal evidence ( places that number in the dozens to hundreds range. That's much more recall-oriented than top3 or top10 web search.

Another example is music recommendation, ala Pandora. You seed Pandora with a few songs or artists that you like, and it then sets up a personalized, recommendation-oriented radio station for you, and streams the music to you at a rate of approximately 20 songs per hour. A couple of hours, over a couple of days, puts the number of recommendations in the hundreds. After a few weeks or months of using Pandora, this number moves to the thousands.

So unlike web search, where people want to find the one answer and be done, Pandora's music recommendation is a longer-term, recall-oriented process. And I'll bet people are even more willing to put up with some bad, and even more lukewarm, songs in the mix -- because they're more interested in getting as many good, different, interesting songs (dozens? hundreds?) as possible. Picking the 10 items that someone will love is not the only thing that matters to them. Recall trumps precision.

eric chaves

I think that the 5 star recommendation system is fundamentally flawed as a preference rating system. The five star system was meant to be a democratic rating system, and should have been used to measure individual preference. Netflix should have posed the challenge to develop a better rating system, not a better algorithm. Read more here:

The account that made this comment no longer exists.

I've posted on this topic at

RMSE doesn't reward a system that's aware of its own uncertainty, and distinguishing between mediocrity and controversy does require a model of uncertainty.

Scott Wheeler

Another thing that seems to be often overlooked is how you get users to trust recommendations. When I first started playing with recommendation algorithms I was trying to produce novel results -- things that the user didn't know about and would be interesting to them, rather than using some of the more basic counting algorithms that are used e.g. for Amazon's related products. What I realized pretty quickly is that even I didn't trust the recommendations. They seemed disconnected, even if upon clicking on them I'd realize they were, in fact, interesting and related.

What I came to from that was that in a set of recommendations you usually want to scale them such that you slip in a couple of obvious results to establish trust -- things the user almost certainly knows of, and probably won't click on, but they establish, "Ok, yeah, these are my taste." Then you apply a second ranking scheme and jump to things they don't know about. Once you've established trust of the recommendations they're much more likely to follow up on the more novel ones.

This differs somewhat from search where the catch phrase is "authoritative sources" (stemming back to Kleinberg's seminal paper on graph-based search) -- you want to hit the right mixes of novelty and identity, rather than just finding high degrees of correlation.

Phoebe Spanier

Perhaps for the best of both worlds, focusing on improving both search and recommendations (precision and recall) to offer people the two options for discovering media is the way to go.

Aleks Jakulin

I've posted on this topic at

RMSE doesn't reward a system that's aware of its own uncertainty, and distinguishing between mediocrity and controversy does require a model of uncertainty.

Displaying all 10 comments