Top 100 Rikishi Ever - Sumo Elo - A Preliminary High Level Look and Discussion
Breaking Down Some Data from 1958-Present
I’ll start this off with a question: who is the greatest sumo wrestler ever? I’m guessing you answered Hakuho. No matter your preferred metric, he’s likely at the top. If you want lots of championships then that’s him; he has the most. If you care the most about peak performance, then again, with all his Zensho Yushos (going 15-0 in a tournament and winning it) the answer remains Hakuho. As a boxing fan, another test of greatness is how you match up against past greats. Call it the Muhammad Ali test (or your preferred weight class's greats). Again, on the head-to-head metric, you probably chose Hakuho - with his size and power and technique, it’s hard to imagine anyone consistently dominating him. He did great against the Yokozuna he faced in his day too, so it’s not idle speculation.
Over the past few months, a project I’ve been working on is generating Elo scores for all wrestlers in the modern era (1958-present). If you’ve heard of Elo scores, it’s likely through their use in chess. It’s a formula pioneered and implemented by Arpad Elo (yes! It’s in fact a name, not an acronym). It’s all summed up here and I leveraged the formula included to implement my version.
So that question up above, figuring out who the greatest is, is actually pretty easy. Even if you didn’t name Hakuho, you likely had someone in mind. Here’s a different question: who is the 481st best wrestler since 1958? It might take you a bit longer. The nice thing about Elo is it let’s us answer that. Every match, two wrestlers go against each other. Whoever wins will have their Elo score go up some, and whomever loses will have their Elo score go down the same amount. This is nice, because it let’s us take every match, and calculate the Elo score for each wrestler, every day of every tournament of every year. I did just that. I built up a dataset and it ended up being 1,515,742 individual Elo scores for 6849 unique wrestlers. It might not be perfect, but spot checking it, and sense checking it, I’m pretty happy with these preliminary results. Hakuho was at top, of course. 481st was Daiseizan, a Chinese wrestler at Makushita 13 E as of Haru ‘24 if you use my calculations.1 It’s more for fun and I’ll look to improve but I thought the prelim results would be neat to share and it’s a nice way to think about posing a question analytically.
In fact, an Elo score actually has a lot of interesting properties and assumptions baked in but for now the important part to keep in mind is this: it should provide a rough ranking, and that ranking can be used to predict the outcome of future matches. I’ll quote two key points from the above link:
“We assume that every competitor plays at some average level, with some deviations - sometimes they play brilliantly, while sometimes they are below their expectations. In general, we can describe a player's overall performance with a bell curve (or normal distribution) centered at his/her current rating. This graph shows that an individual's performance in a game corresponds to a random variable from this distribution. For example, a player with an 1800 rating "plays like a 1500" on a bad day, while on a good day, he can perform as if he was 2200. Usually, he plays around his typical 1800-level.”
What this means is that you’re not a fixed skill level; just as you can have a good day or a bad day at work, a sumo wrestler can have a good day on the clay or a bad one. So just internally when you see a wrestler has a rating of for instance 1500, then really just think: that’s what he should be on an average day, but he could wrestle better or worse on any day. That’s the takeaway.
“Let's begin with the mysterious
400
in the Elo rating change formula. The scale is built so that a player with a 400 point advantage over their opponent is ten times more likely to win than to lose. A player with a higher rating of 800 points is 100 times more likely to win, and so on.”The important thing here is less so quantifying the advantage or the numbers here and more so that they should imply an advantage for the higher ranked opponent. Like I said, a good future post would be to test if this is true or not, but at least using the Elo ranking system it is built with that in mind
I included an Appendix at the bottom that gets a bit more into the weeds and talks about more statistics related questions and methodology.
Without further ado, let’s get the Top 100 Wrestlers by Peak Elo Ranks!
Because this is calculated off every single match since 1958 I can get as granular as the day they achieved their highest ranking, hence the Year-Month-Day column. It’s exactly what it sounds like with Day being ith day of the tournament that month. I.e. Hakuho was at his greatest Elo score on the 15th day of the July tournament 2021. He was Ranked Yokozuna 1 East at that time.
I can’t find the exact quote but I believe Bill James, the baseball statistician once said something along the lines of: “a statistic should be 80% what we expected, and 20% stuff we didn’t know before”. The basic idea is that if a statistic or ranking or whatever should pass the smell test - if Hakuho wasn’t at the top or at least top 3, then that means I’m probably doing something wrong. But also, it should ideally be giving us some new info.
I do think that as of now one issue is that this becomes a bit of a counting stat. It seems fairly obvious that Hakuho wasn’t at his peak right before he retired, but basically, unless you’re competing while injured and racking up losses, then you’ll be able to preserve your career Elo pretty well. I also need to investigate some inflation at the top end but since it’s Yokozuna at the top and this is preliminary (I now have the datasets saved locally so will be able to re-run calculations much easier) I’m just caveating it for now, and will return if I find anything systemic. Check out the appendix if you want more discussion of this nature.
As for verifying that my rankings are ‘correct,’ I’ll be completely honest, I looked around for “top sumo wrestler ever” and similar queries on google and was unable to find any good lists I could use as a comparison point. I did ask chatGPT and it actually returned 8/10 of the wrestlers in my top 10 with the differences being: Futabayama (prior to my dataset, or out of sample) and Wakanohana. So mine had Kitanoumi and Wajima instead. They have 24 and 14 Yusho respectively so I’m pretty happy with those two being in my list. And that 8/10 is pure coincidence after I had already typed the Bill James bit above, but it’s a nice coincidence that hopefully makes this look more rigorous.
Other statistics of interest:
Average Elo: 1079
Standard Deviation: 498
And a graph of all Elo (cutoff at 5000 as it’s just the top 10 Yokozuna and days they were above 5k) in a histogram (not just peak)!
Appendix
If you’re reading this far, hopefully you enjoyed! Here I’ll have a little more nuanced discussion. I don’t think anything above is wrong per se, but I also think if I was doing an academic lecture on this, I’d probably use a lot more hedging words to make clear that what we’re doing is naturally constrained.
How did I deal with injuries? Good question! If there wasn’t a loss recorded, then there was no match to calculate Elo for, and so rank wouldn’t go down. I actually realized after I ran all this I didn’t exclude matches with a fusen win. That kind of stinks that it penalizes a wrestler for pulling out the tournament at the wrong time. Luckily it only affected 3643 out of 814,696 matches or .0044 (in finance speak that’s 44 bps). Not too worried about it for preliminary results at least.
One hypothetical: what if you for some reason wanted to maximize your Elo. You’re a good wrestler but not good enough to be an Ozeki or Yokozuna. Could you just work your way all the way up from Jonokuchi (lower division) to Makuuchi (top division), hit the Sanyaku/Joi and then just sit out so that you’ll be demoted and do it all over again? That way you would never lose points (from losing) and just keep gaining them (by beating everyone except the Yokozuna and Ozeki on your way to the top)?
Not really. If you’re in Makuuchi and go down to Jonokuchi because of injury or suspension, then the way the formula works, you don’t gain many points against wrestlers whose ratings you’re much higher than.
In practice? Eh it actually does work out pretty well. Abi and Asanoyama are two guys that benefitted from it with their covid suspensions. Back to the hypothetical, let’s say you’re Asanoyama and for some reason decide maximizing your career Elo was your dream instead of Yokozuna, then I think he could game the system by sitting out every other tournament and staying in the mid-Maegashira and racking up wins.
What I’m getting at is, I do think it became a bit of a rating of longevity as much as a rank of peaks, certainly on the higher end. So I actually have some preliminary thoughts and the start of a model about how else I might quantify peaks when we’re talking about Yokozuna or top end wrestlers, but I’ll talk about that another time.
For the Elo rating discussion, I would recommend checking out the Wikipedia if stuff like this interests you, but if not I’ll sum up. For instance, there is suggestion that performance in chess does not follow a normal distribution. I ran the Shapiro-Wilk test to see if these Elo ratings above were normally distributed and they are not.
I initiated everyone at a rating of 1000. It’s honestly relatively arbitrary. I’ll probably look to see if there are any transformations that make sense to scale everyone down, but that’s another post.
I used a K factor of 20. Again, it’s arbitrary. It basically acts as the brake on how much your rating can go up. The most a match can change your Elo is thus a 20.
In practice I believe chess federations will give you a higher K factor like 40 for younger or newer players, and at the higher levels of chess, your grandmasters, the K factor is 10.
I don’t think the above two make a big difference.
The web scraping I did to get this was pretty finicky, which means if anything I’m more confident in the accuracy of the underlying data - minor details could break it - so the fact it worked itself is evidence of the data being accurate from sumoDB. Furthermore, I looked and there didn’t seem to be any weird variables wherever I checked. In that sense I feel pretty confident in it.
The only other things I’m worried about is that runaway Elo rating for some of the top Yokozuna. It looked like around 3,000 some of the ratings were getting inflated. I’m still doing some data sleuthing on the calculations but I’m pretty confident Hakuho will still be the top dog. More seriously, if I have any updates I’ll correct this post and update. It’s more a fun exercise and way to think about some analysis anyways. Hopefully you enjoyed.
I’m being a little glib here. I mean that is correct, caveated that it’s with a dataset that’s up to Hatsu ‘24 - so no Haru ‘24 where he’d likely be higher but I haven’t ran the numbers yet. But yeah that is the calculated 481st peak - it was on day 12 and he was at 1,407. I’ll look to see if I can improve on the calculations and host it somewhere if there’s interest in that
This is interesting analysis to me, as I've done some similar work to assist my "Fantasy Sumo League" play. I used a slightly smaller dataset (from 2000 to the present, which was enough to cover all current top-division men's full careers) and the Glicko 2 algorithm (which is inspired by Elo's ratings, but it's a bit more fancy since it doesn't need to be calculated by hand on paper the way Elo ratings did when they were first introduced for chess).
One thing Glicko 2 does that's better than Elo is that it can take absences from play into account, not just assuming that your base skill level remains constant forever if you're not competing. It does that by having an "uncertainty" rating for each player's score, which grows as you take more time off. A conservative rating then subtracts (some multiple of) the uncertainty from the base score before comparing players. This means that somebody like Asanoyama or Abi who fall way down the divisions during a suspension gets a slightly lower conservative rating after their absences, and the confidence of their rating only stabilizes after they've done a fair amount of climbing back up the ranks.
I'll see if I can get a larger set of sumo data and run my algorithm over the whole modern 6 basho era and get lifetime best scores to compare to yours. I suspect they'll be very similar! I know I verified that Hakuho had the best ever score in my dataset around the time he retired. That was, as you said, a smell check to make sure my code's output made some degree of sense. I'll need to write a bit of extra reporting code to extract a similar list of the top scoring wrestlers of all time, but it shouldn't be very hard, once I have gathered the data.