As you probably know, at the Neos conference there's a Best Talk Award. The best talk of the conference gets a prize. At the end the speaker is asked to come on stage and gets nice trophy.
But how can we determine the best talk? Is it even possible? What is good and what is bad and what is best lies in the eyes of each and every one of us. So if a task is impossible to solve, just change the task.
So let's rephrase: what is the talk that was most enjoyable for the audience? Well, it's easier to answer but still very hard because I cannot read minds and so I don't know what everybody thinks.
Okay, let's change the task again: which talk was rated best? So at the Neos conference every attendee can vote for a talk with one to five stars, indicating how much she/he enjoyed listening to it. This should be easy, right? The talk with the most stars on average wins, right?
Turns out it's not that easy. If you have a talk with one five-star voting and you have a talk with two five-star votings and you have a talk with one hundred five-star votes, then the average is the same. But, from a gut feeling, I would say one hundred five-star votes is different from a single five-star vote.
Okay so let's rephrase the task again: Based on the votes of the audience, what talk is probably the most enjoyable? Given the evidence we have, what is the talk which is most likely to be the one most enjoyed by the audience?
Welcome to statistics!
Sebastian during his talk at the Neos Conference
The main idea
We determined the winning talk by computing an individual score for every single talk and the talk with the highest score wins. So far so easy. The algorithm we use for computing this score is quite nice, because it takes into account quality and quantity of the votes.
A short disclaimer: I'm a computer scientist. I learned a fair share of statistics, but I'm not an expert by any means. I implemented an existing algorithm and, out of curiosity, tried to make sense of it later.
The names for the algorithm I found most often were Additive Smoothing and Bayesian Estimate. What does it mean?
In a nutshell, we start with no knowledge of the talk we want to score. So if we know nothing about the talks, we just have to assume they have been all of equal enjoyability, if that's a word, of equal score. The more votes we have for a specific talk, the more confident we can say, well, this talk is different, this talk stands out. We have a lot of evidence to believe that the score of this talk is different. This is the main idea.
So the algorithm works as follows, we just assume that every talk is like any other. We just grant each talk the average stars for the overall conference. The more votes a particular talk gets, the more we take those individual votes into account.
Let's say a talk has 10 votes, then the 10 votes are merged with the conference average. But they count less than for a talk, which has 100 votes, because 100 votes would make us more confident in saying, okay, those votes mean something. So 100 votes deserve more attention than 10 votes.
A different way to imagine the algorithm: every talk starts at the baseline. It's the average, the mean number of stars for the entire conference. For each vote, we take a small step into the according direction. The more votes the talk has, the more steps we take and the further we may get away from the baseline.
The Formula
So let's have a look at the formula.
score = (v / (v+m)) * R + (m / (v+m)) * C
where:
R = average stars for the talk (mean)
v = number of votes for the talk
m = minimum votes required
C = the mean stars across all talks
You see, the overall score is composed of two parts. C is the baseline, the average stars over all talks for the entire conference, and R is the average for the current talk. They get some weights attached to them. And those weights, they end up being 1. So the overall score, like the votes, is between 1 and 5.
And as said, as the number of individual votes increases, the coefficient of R gets larger, and the one of C gets smaller. Which is exactly what I tried to explain before. So now we can play around with those numbers.
We have four variables we can adjust: We can adjust C, the conference average. We can adjust R, the mean votes for the current talk. We can adjust v, the number of votes for the talk. And we can adjust m, the number of votes we want to have before saying: okay, now the individual votes of the talk become relevant.
Visualization
(screen too small)
On the y-axis you see the total score of a talk. The x-axis shows the number of votes of a talk v. There is one line for an arbitrary talk with an average R of one, two, three, four and five stars.
You can see and move the conference average C as a pink point on the y-axis. The moveable pink point on the x-axis is the weight of the conference average over the individual votes m.
Conclusion
I personally find this formula quite fascinating, not because of its complexity or simplicity, but I got really stunned when I evaluated it. The most important thing about this scoring is that it must be fair. But what is fairness? How do you compute fairness? It's a gut feeling.
This is why we rejected the average, because we had the gut feeling that there's more to it. And when we re-implemented this formula for the last Neos conference, we again evaluated whether those scores feel fair or unfair. And they felt right for a variety of test data. It's a very, very elegant way of introducing this notion that 100 5-star votes mean more than one 5-star vote does.
Where the formula gets more controversial, where everybody seems to have a different opinion is the parameter m. What about 4-star votes? Are five 4-star votes better than one 5-star one? And six? And twenty?
Internally, we have discussed it a lot. And most importantly, we decided about m up front, before the first votes for the first talk. To protect against biases of, for example, myself, we decided against fine-tuning the algorithm after the voting. Instead m is determined by the average number of votes per talk.
Bayes' theorem
Starting with what we know, adjusting our knowledge according to the available evidence: this reminds me strongly of Bayes' theorem. I am not sure if and how Additive Smoothing and the Bayes' theorem relate to each other. If you do, please give me a hint. The following video by 3blue1brown provides an awesome explanation of Bayes' theorem.
Further Reading
While implementing and understanding the algorithm the following resources helped me:
Thanks for reading. If you have any star-based ranking system, keep this approach in your mind.
Feel free to play around with the parameters in the visualization, and as always: if there are any comments, feedback, questions, feel free to contact us.