Since May 2016 I’ve been working on what’s become an International Ranking System for the highest level of TF2 6v6 teams. I’ll say straight away that this is not meant to be the be-all and end-all of TF2 ranking systems, and some of its judgements show this. It’s entirely logs-based, and you don’t need me to tell you that there’s an awful lot more to being good at TF2 than just producing big numbers on logs.tf. However, I do still feel that the data maintains a good level of accuracy that at least makes for interesting reading, and it did correctly predict the finishing order of both i58 and ESA Rewind.

How it works

I’ve been keeping a private record of top-level TF2 matches for almost a year. Each entry features information logging what players were playing each role (including pocket and flank scout), what their overall DPM was (except for medics), and what their overall KA/D was. For medics, I log Healing per Minute rather than DPM. Why HPM and not ubers? Because of the Vaccinator. I also log other stuff, but it’s these stats that matter for the Rankings. Because of the international scope of this system, any data I do track must, of course, be available from both logs.tf and the ESEA logs.

If a player, the Pocket for example, has a higher DPM and higher KA/D than his counterpart on the other team, he gets ‘gilded’. Whenever a player makes an appearance in one of these record entries, their presence and whether or not they were gilded is recorded. Each player in the rankings therefore has two numbers next to their name: their number of appearances, and the number of times they’ve been gilded. To keep the Rankings current, these two numbers are based on only the last 300 matches in the Records. This covers a 7-ish month period, enough to span roughly two seasons for each region. To give an example, in the past 300 matches (at the time of writing) Saam has 25 entries and was gilded 4 times, also written as 4/25.

These two figures per player in the Rankings then go through a bit of processing to produce two more figures for each player. These are what I call the ‘hit-rate’ and the ‘mileage’. The hit-rate is very simple – the percentage of appearances in which the player was gilded. For Saam, this is 16%. The mileage is the number of entries and number of gildings added together, which for Saam is 29.

Each player’s hit-rate and mileage are now compared to everyone else’s hit-rate and mileage to determine their score – the metric by which they are all ranked. The score is two numbers added together: it basically counts how many players in the Rankings have a worse hit-rate than you, and adds that to the number of people who have a worse mileage than you. If you were on a list of 100 people and you had the 10th-best hit-rate and the 50th-best mileage, your score would be 140.

This applies to every player in the Rankings, except for one small divergence regarding people who are 0/1 (one appearance, never gilded). Because these Rankings only cover the last 300 matches on record, there are a bunch of (unlisted) players at the bottom who are now 0/0, and rightly have a score of 0. However, these people aren’t included when determining peoples’ scores. This means that people who are 1/0 would normally also get a score of 0 because there’s ‘nobody’ worse than them. 1/0 is obviously better than 0/0, so people who are 1/0 have a special score that’s halfway between 0 and whatever 2/0’s score is.

You can probably imagine how these player rankings get turned into team rankings – the six team members have their scores averaged out to determine the team score. Some of the Asian teams and one or two of the Aussie teams seem to turn up to every match with a different roster, and in these cases I’ve generally determined their six players to be the ones most often seen on each role. Only top-level (i.e. Prem/Invite/equivalent) teams are included in the team rankings simply because it would be a daunting task to keep track of the rosters of the many lower-level teams.

Interplay between regions

This system ranks players from all four regions together in one single list. You might be wondering how I factor in the skill differential between the four different regions. Early on I tried a variety of techniques to do this, but in the end I realised that this system actually polices itself in this regard to a good extent. Most would say that of the four, AsiaFortress is the scene with the lowest skill standard. It also just so happens to be the region that features the fewest matches. The Asian scene doesn’t appear to really feature LANs and secondary tournaments, and its top division in Season 11 and 10 featured only 7 and 4 teams respectively rather than 8. This basically puts a natural cap on Asian player mileage, restraining them in the Rankings even if their hit-rate is north of 80%.

Ozfortress, meanwhile, has a healthier prem division of 8 teams but unlike Europe there aren’t many secondary tournaments and LANs, again restricting the heights Aussies can reach by limiting their potential mileage.

I think many would say that North America and Europe are at least somewhat equal in quality, and it just so happens that this, too, is reflected in the level of activity in each region. ETF2L has pre-season Premiership playoffs and secondary tournaments and European LANs come around every once in a while. North America is generally a bit less rich in secondary activity, however this is accounted for by all the ESEA-I teams playing eachother more than once in the group stage unlike in ETF2L. This leaves these two regions with a higher mileage ceiling than the other two.

With all that in mind, I’ve not imposed any artificial restraints on the heights that AsiaFortress or Ozfortress players can reach. All the numbers are pure and unmodified.

What matches count towards the rankings?

I started out making records for every top-level ETF2L, ESEA, Ozfortress, and Asiafortress official, plus basically anything else that got a TFTV stream. Nowadays, I use a much more mature system of guidelines to determine whether or not a match should have a record entry. These are the matches that get recorded:

  • Any official ETF2L Prem, ESEA Invite, OzFortress Prem, or AsiaFortress Div-1 match, be they in the group stage or playoffs.
  • The grand final of the second division down in these same leagues (i.e. ETF2L High, ESEA-IM, etc).
  • Any ETF2L Pre-Season Premiership Playoffs match that results in a team being promoted to Prem.
  • Every match, group stage and playoffs, of the invite tournament of a major international LAN like Insomnia or ESA Rewind, featuring only top-level teams.
  • The playoff matches, but not the group stage, of smaller-scale LANs featuring more than two top-level teams (such as Dreamhack Winter Battle for the North).
  • The grand final of LANs featuring perhaps just one or two top-level teams, possibly including the upcoming Gamers Assembly LAN.
  • The playoff matches, but not the group stage, of secondary tournaments that include many top-level players (such as the ETF2L 6v6 Nations Cup or TFTV New Map Cup).
  • The grand final of any other secondary tournaments that only feature a handful of top-level teams/players (such as some of the FACEIT tourneys that happened a little while ago)
  • Any showmatch featuring top-level teams.

Matches that meet these criteria may still be excluded from the records if they’re plainly unrepresentative (e.g. a 5v6, half the players are pyros and spies the whole time, etc).

I’ll reiterate that this is not a perfect system. For example, Stark, widely regarded as one of the greatest players ever to grace TF2, peaked at a mere 14th-best in the world by these metrics at the time of writing.

With all that said, the current standings can be viewed in full via Dropbox here.

The blog

I plan on updating this blog regularly, but on quite a random basis. If something comes up that I want to talk about, then I’ll talk about it. I hope to not go more than a week with no posts, and there may be times when updates come out quite rapidly. At the very least I should provide a weekly update with commentary on any significant changes in the rankings, at least while the scene is active. Other than that, I’m also likely to post:

  • Special reports about how specific matches have influenced the rankings.
  • Analysis of the top-ranked players within their region/class.
  • General discussion on the accuracy of certain players’ or teams’ placements within the rankings.
  • Analysis of how and why individual players or teams are ranked the way they are, and how their position has changed over time.
  • Speculation about how newly-formed teams will perform based on these rankings.
  • Toying about by creating completely fictitious fantasy teams.
  • Comparisons between similarly ranked teams.
  • Comparisons between actual match results and those predicted by the system.
  • Explanations for the inevitable outliers in the system.

Why this system works the way it does

The purest and most reliable way to rank TF2 teams is by their match results. No metric paints a better picture of a team’s relative potency than their record within their regional league. In ETF2L Season 26, Lemmings scored more points in the group stage than Nunya because they won more matches. It’s plain that this is the superior team. Yet, in the rankings used here, Lemmings were ranked below Nunya despite the unarguable fact that it was the former that performed better in the season. This is a symptom of the imperfections of the TF2Metrics system, in part because at its core it was designed to rank individual players first and foremost.

It’s also clear to see that some of its player rankings are misguided as well. Dr. Phil, one of Europe’s top demomen of recent seasons, barely broke into the world’s top 80 by this metric. A big part of this is just a limitation of stats from logs. One of the first things I established above is that there’s an awful lot more to being good at TF2 than being a producer of big numbers. Stark, Freestate, and Dr. Phil are examples of this – the numbers can’t tell the whole story for individuals like these. That’s why the eye of an expert will always be deserving of more trust than a broad brush number-cruncher like this. For individuals like these, match results do them better justice – Stark and Freestate both merit(ed) a position on their region’s greatest team, and Dr. Phil’s team, LEGO, was for a good while one of Crowns’ toughest playmates.

But the truths of pure match results have limitations of their own. Uubers and Muuki are on the same team and have therefore shared the same results as of late. The same applied to Sorex and Cold Heart on Lemmings. But does an identical results sheet indicate parity in raw skill? Sometimes it can, but often not. Most would agree that Uubers is a more potent TF2 player than Muuki is, and that Sorex is similarly above Cold Heart, at least at the moment. That’s what the purpose of the TF2 Metrics system is – to provide a means by which players can stand out not only from among their opponents but from among their teammates as well.

I’ll reiterate again that just because this system sees someone as superior to somebody else, because their numbers were a bit bigger, doesn’t necessarily mean it’s true. There have been times where, due to a fortunate string of good logs stats, certain players have been placed higher in the rankings than most of us would believe to be correct. On the other hand, it takes time for players, even really good ones, to climb up to the sharp end of the rankings. When Kaidus started playing again with SE7EN, it took him until well into the second half of the regular season before he amassed enough gildings to even break into the world’s top 100 players.

By and large, though, the right players tend to stick out. Despite being on the same team and winning and losing all the same matches together, Uubers is currently ranked considerably ahead of Muuki. A system based only on match results would see these two as being equals.

It’s similar with Sorex and Cold Heart. I think most would agree that Sorex is a world-class scout and he holds, probably quite deservedly, the rank of 21st-best in the world by this metric at the time of writing – in the company of Yomps, Puoskari, Yui, and Elmo. Cold Heart, despite sharing all the same match results as him this season, actually never got gilded (which is probably rather harsh) and so is ranked low. Again, looking at the season’s match results alone would expose no skill disparity within their team.

This can of course be alleviated to a degree by expanding the scope within which the data is gathered. Cover five or six seasons rather than one or two and suddenly Sorex has miles more Prem wins than Cold Heart does. But there are drawbacks here, too. LEGO stuck together with few changes for ages and if a team did this for long enough then the same problem arises where the team’s more capable players end up with no way of separating themselves from their peers.

Another issue is when great new players explode onto the scene, like Puoskari and Maros. A total wins system with a large scope would cause inaccuracies with players like these because of season-long delays before they finally accumulate enough wins to get the status they deserve as being among Europe’s top players.

What about a percentage wins system, then? Again there are drawbacks. As things currently are in Europe, there’s no way Sorex could match Puoskari in win percentage despite them being players of probably similarly great skill. And what if a player spends two seasons underperforming on a low-level Prem team and then suddenly something clicks and he finds himself as one of the following season’s best players and winning many matches? His win percentage across those three seasons could still only be perhaps 30%. Given time the figure would correct itself, of course, but at the end of that third season he’d be ranked behind average players on average teams with ~50% win rates. Shrinking the scope to only include one or two seasons at a time helps to solve this but now we find ourselves back at the beginning again, where Cold Heart or Muuki can draw level with Sorex or Uubers.

And that’s why the TF2Metrics ranking system works the way it does. It makes plenty of mistakes of its own and it will always undervalue the more logs-modest and disciplined players like Freestate and Stark. But at the same time, it strikes a good balance between reflecting the competitive TF2 scene as it is in this particular moment, rewarding players who win matches, and rewarding players who don’t win so many matches but still excel on their team.

Nevertheless, this system will never hold a great deal of authority as to the true order of things. What it’ll never be anywhere near as good as, naturally, is the ultra-informed judgement of an expert, and it’s from them that the best and truest insight will always come.

They say the stopwatch don’t lie, but numbers in TF2 are different. They can be suggestive of things, but never fully indicative. That will always be an intrinsic weakness of a stats-based ranking system such as this.

The projection machine

As a side-project alongside the Rankings, I’ve made another little gizmo that makes predictions on the overall scorelines of upcoming matches. It looks at each team of six I give it and it produces a rough scoring ratio for the two teams that’s directly based on their overall scores in the rankings. It’s only a very general guess and the only thing it offers is a general scoring trend across all maps in whatever match-up you give it. It’s certainly not always right, but usually its predictions end up being somewhere between vaguely right and spot on. However, it’s nowhere near clever enough to account for finer details like what maps are being played.

Since it’s reliant only on the scores of a team’s individual players, the projection machine can be used for any combination of twelve players, as long as those players are familiar to it. When working with fantasy teams, especially intercontinental ones, you can rightly say its conclusions have little grounding in reality. It’s basically one step short of plucking numbers out of thin air in such situations. Still fun, though.

How the records work

A player’s position in the rankings is ultimately based on two numbers – the amount of times he’s made an appearance in the last 300 matches on record, and the number of times he’s been gilded in the same time frame. This section will shed more light on what exactly the records are and what they look like.

AFC11 - GF - Burger Apocalypse v bb Tommy
An example of a Records entry from the Grand Final of AsiaFortress Season 11. It’s one of over 500 to have been made so far.

This project didn’t start out as a ranking system. Originally I only wanted to log some basic match data in an Excel spreadsheet so that in the future I could look back and see how far players have come, how much team rosters have changed, and so forth. To do this I made space to log every player and their classes in the match as well as their team names. To expand on just the results, I also decided to include a few simple stats that are available from both logs.tf and ESEA logs – these were DPM (or Heals per Minute for Medics) and KA:D. I kept track of the maps played, and what the scores were on each one. I left space for short notes where I could point out any mercs and leave my own observations on the match. I added automated visual aids to all this to help it look nicer. All together these features create what is essentially a very small and simplistic infographic for each match.

In these, I differentiate between the two scout roles – pocket and flank. The detectable differences between the two can often be more nuanced than they are with the similar soldier differentiation (especially for a layman like me), not least when you’re going based only on stats which is often a necessity in un-casted matches featuring unfamiliar teams. When in doubt, I designate the scout that received less healing the flank scout.

Once I’d done all this for a few matches, I started paying more attention to the DPM/HPM/KA:D numbers as these sometimes shed light on what players excelled in matches that weren’t casted, independently of the match scores. To help with detecting this, I added some automated formatting that recognised whenever a player exceeded his counterpart on the other team in these stats. The colour scheme I’d been using was blue and red, representing the two teams, so I wanted a nice neutral colour to highlight these people. I settled on a light shade of gold, which is why I call it ‘gilding’. It was a while after I added this feature that I realised gildings could be a base upon which a ranking system could be built.

Over time I added extra gizmos to each of these entries. First I created the projection machine which simulates the match based on the players it sees. Every records entry features the machine’s prediction on what the score balance was going to be. The number on the left is the expected score of the Blu team and the right number is that of the Red team, so a prediction of 1.5-3.5 indicates an average round count of 1.5 for Blu and 3.5 for Red. Beneath this is the true score balance of the match – the average number of round wins for each team across both maps. At the bottom is an accuracy indicator which shows how closely the prediction matched reality.

As well as this I added a section where, based on a mathematical formula, four players from the match are spotlighted as standing out from the rest. This is based on DPM/HPM and, with the former, it not only compares how each person did in comparison to his counterpart, but also in comparison to the highest value from everyone. It’s a fundamentally more subjective approach than the gilding system which is extremely plain, which is why it doesn’t affect the Rankings and is instead simply another visual aid to suggest what players may have excelled the most. It lists them in descending order of perceived excellence.

Similar to this is a gizmo that makes an estimation of how exciting the match was. The original purpose of this was partly to highlight matches that might be worth watching the VoD of had I missed it live, and partly because I simply wanted to see if a match quality indicator is a feasible thing to make at all based on these limited stats. The figure is based primarily on KA:Ds and is seen in the top-left corner of each match entry in varying shades of green.

Originally it was based only on how close the teams’ overall KA:Ds were to eachother, and also on the outright magnitude of those numbers. This was all well and good until i58 came along and Full Tilt faced Froyotech in the group stage. In that match, Muma had a KA:D of 7.7 and Kos’s was 5.3. That means both teams had a very high average KA:D and it produced a match quality number of something like 12 when the usual range was 0.4-4. I mended this by making it deflate the number if there’s too much variance among all the players’ KA:D numbers, in proportion to exactly what the magnitude of that variance is. If this variance-based safeguard kicks in, the number in the box turns grey. That match’s score, once amended, came down to 3.43.

The figure is modified further based on how many gildings the players involved have on record. It’s boosted up quite significantly if a match features lots of heavily-gilded players, and barely changes if there are few gildings between them. Despite its complexity it often misjudges matches – seeing excitement in matches that were actually quite dull and not seeing it in matches that were thrilling. That said, the highest-scoring matches I’ve seen really were all of tremendous quality. The current record-holder is a fundraiser showmatch between Crowns and Full Tilt that happened just before i58. It sees this as the best match of the past 13 months. In a close second place is the ESEA Season 22 Grand Final between Ronin and Froyotech.

Filling in one of these match entries really doesn’t take long. All I have to do is fill the form in with the essentials such as the date, the league, the teams, the scores, etc. Filling in the performance data is easy, too, as I just copy it from logs.tf. When I paste a new form in, the DPM/HPM/KA:D cells come with an Averaging function built in, so all I need to do is add the numbers from each map and the averaging is done automatically. It takes no more than five minutes per match in total. With ESEA matches it takes a bit longer because I have to manually calculate the KA:Ds and HPMs because their logs don’t feature those stats. Thankfully they still include the components you need to calculate those figures yourself.

The formatting, the projection machine, the standouts, etc, are all done automatically. It recognises the names of the players I type in and produces their gildings/entries figures next to their names. When a form is filled in, the actual Rankings spreadsheet also responds automatically to the new data. It’s not a time-consuming effort to keep things up to date. Since I started this in May last year, I’ve now got a total of over 500 completed entries, of which the last 300 influence the Rankings.