I’ve been keeping track of certain stats from top-level TF2 matches since May 2016, and I’ve gradually built a ranking system around it. The way this system works has gone through a number of iterations, with each exhibiting more accuracy than the last. The way it currently works, which I first put into use for i61, is called the distinction system.

Here’s how it all works. I have a spreadsheet that I use to log top-level TF2 matches. What qualifies as a top-level TF2 match is as follows:

  • All officials from the top division of ESEA, ETF2L, Ozfortress, and AsiaFortress.
  • The grand final of the next division down in each of the same leagues (e.g. ETF2L High).
  • Any ETF2L pre-season playoffs match that results in a team getting promoted to Prem.
  • Every Invite-level match from big LANs like Insomnia or ESA Rewind.
  • Playoff matches at smaller LANs (e.g. Dreamhack Battle for the North) or secondary tournaments (Essentials.TF One Night Cups). More and more matches, counting back from the grand final, will be put ‘on-record’ depending on the overall quality of the tournament entries.
  • The grand final of small secondary tournaments or LANs with a broadly top-High/low-Prem level of competition.
  • Any showmatch between top-level teams.

Matches that meet the above criteria will still not be counted if one team was clearly not trying at all (e.g. they ran two spies and no medic the whole time), or if it was a 5v6, and so forth.

OZF - GAR eSports v Noesky

Each match that meets all these criteria has a small chart made for it, like the one above. Some basic stats are logged: DPM for the combat classes and Heals per Minute for medics, plus KA/D.

The green box you’ll see in the top-left contains a number based on a long and convoluted formula based on a variety of stats. The goal is that it’ll estimate the general quality of the match based on these numbers, but it’s not always accurate. Generally this comes out between 0.5 and 5, with 0.5 suggesting a very dull match and 5 signifying a modern classic.

On the opposite side of each chart is a box titled MVPs. This list is automatically generated and it, too, is based on stats from the match. It compares how high each player’s numbers were in relation to everyone else, and also in relation to his counterpart on the other team, and uses those comparisons to decide which four players stood out the most. How well each of the four players stood out is shown as a percentage (which doesn’t really have a specific meaning), while the number in brackets signifies how many times that same player has been MVP’d in the past 500 matches. It’s quite subjective and tends to undervalue certain roles, so the MVP box has no effect on the rankings.

What does affect the rankings are the DPM numbers, KA/D numbers, and match results. In the TF2Metrics system, players are only ever compared to the player on the other team who played the same role as them (e.g. pocket soldier, flank scout, etc). If a player exceeds his counterpart in either DPM (or HPM) or KA/D, he receives a silver background to his name. This is the first of the three distinctions you can get. If he exceeds his counterpart in both of these stats, he secures the second type of distinction which gives his name a gold background. If you manage to get this type of distinction, you also by definition must have got the first as well. The third type of distinction is awarded to every player on the winning team. If the match was a tie (e.g. one map each), nobody receives this distinction.

This means each player in the match can secure a maximum of three distinctions for each match he participates in, and each team of six can therefore secure a maximum total of eighteen. The share of distinctions between all the teams is shown in the two boxes at the top of each chart near the team names. The amount of distinctions received, in comparison to how many matches they’ve participated in, is what decides someone’s place in this ranking system.

Next to all the player names in the chart are a pair of numbers, one on a background that’s some shade of gold and one on grey. The first of these tells us how many distinctions this player has received in total from the past 500 matches (which covers about a year), and the second is the number of matches this player participated in in that same time-frame. The bold numbers in the same columns next to each team name give an average of all their players’ numbers.

The reason I limit the stat pool to the last 500 matches is to help the rankings stay representative of the current scene. It means retired players gradually drop unceremoniously off the bottom of the list and current players aren’t held back or over-complimented by matches more than a year old.

The actual rankings list contains every player to have made an appearance in that last 500 matches. If they’re on a top-level team, this will be shown next to their name. Further along is their region, and their peak (the highest rank they’ve had since this system began over a year ago). The next two columns show how many matches they’ve participated in out of the last 500 (in grey), and their total number of distinctions in the same time-frame in gold.

It’s the following two columns that decide their final score, though. Green shows their hit-rate – the percentage of possible distinctions they successfully secured in all their on-record matches. The purpose of this number is to ensure that players who don’t have many matches on-record, but were very successful in them, are still ranked quite highly. The pink column (which I call the mileage) takes the opposite stance – it divides a player’s total distinctions by three and adds it to the number of matches they have on-record. The purpose of this is to ensure that experienced players, even if they’ve not been heavily rewarded, will still be ranked reasonably highly. This also has the effect of putting a soft cap on players from the slightly less active regions of Asia and Australia, who tend to host fewer matches all in all than Europe and North America.

The final score looks at each of these two numbers in turn and compares them to the rest of the column. It counts how many players in the list have a lower hit-rate than you, and adds that to the number of players who have a worse mileage than you. That number is your score.

Top-level teams are also ranked in this system, and their scores are based on the average scores of every member of the team. Sometimes, though (principally during the pre-season), it has to assess teams featuring players who are currently unranked. Until quite recently, it used to fill any empty space with just a zero, however this always ended up being way too harsh a judgement for unranked players.

Now, unranked players are assumed to have a score of roughly 400, but the exact figure changes slightly as the list grows and shrinks. The way this all works is that all players, including those who don’t even have an entry on the list, are basically given a match-worth of boost to their hit-rate. This is invisible and doesn’t show up in the list, and it has barely any effect on people who have lots of matches on-record. It does mean, though, that people who haven’t even got an on-record match yet start out with a de facto hit-rate of 50%, which is enough to give them a more reasonable starting score. If their first actual on-record match goes well, they’re probably see that number increase, but if their opening match goes badly, they’ll drop down from this initial placement.

This ‘base score’ is also used by the projection machine whenever it needs to account for a player who isn’t currently ranked. The way the projection machine works is it looks at the six players on each team and takes an average of all their scores. Each team’s average is then compared to that of the other and goes through some processes to produce a score prediction.

This score projection is supposed to represent an average across all maps played. Over time, as I’ve taken note of themes and disparities between the machine’s predictions and real outcomes, I’ve fine-tuned the projection machine to be more accurate. For example, it expects matches to be closer and closer the lower the average score is between the two competing teams, while matches between absolutely world-class teams are usually expected to be reasonably decisive unless their scores are extremely close. For ESEA matches, the projection is inflated so that the winning team is always expected to win exactly five rounds.

The projection machine makes a prediction before each on-record match and this is listed on each match chart on the left. This projection is then compared to the true outcome and its accuracy level is determined. The inaccuracy figure is determined by how well the machine guessed the proportionality of round wins between the teams, and also by how closely it predicted the actual score figure for each team. If it predicted a 5-0 and the actual result was a 0-5 in the other team’s favour, that would be 100% inaccuracy. Usually the machine’s predictions are broadly accurate (5-20% inaccurate). Sometimes it’s spot-on, but equally often it’s quite far off (40%+ inaccuracy).

I update the rankings list weekly and make a post to accompany this, complete with charts for the previous week’s matches and analysis about what’s changed since the last update, plus score projections for upcoming matches.

Overall this is a simplistic system whose only requirement is the availability of certain stats. Its judgements will never be as accurate as those that come from proper analysts with an expert eye who can point out real factors that decide what makes a team or player more capable than others. This system is just a broad brush and it makes more than its fair share of errors. Stats can only tell you so much in the quest to find out who’s really the best of the best.