This page was last updated 19/01/2018 and is considered up-to-date.

I’ve been keeping track of certain stats from top-ish level TF2 matches since May 2016, and I’ve gradually built a ranking system around it. The way this system works has gone through a number of iterations, with each exhibiting more accuracy than the last.

I have a spreadsheet that I use to log data from certain TF2 matches. There are guidelines I use to decide what matches should and shouldn’t go ‘on-record’. I used to only include a handful of second tier (ESEA-IM, ETF2L High, etc) level matches, but the recent addition of match weighting means the scope of what can go on-record has greatly expanded. Below are the sorts of matches that count towards the rankings:

  • All officials from the top division of ESEA, ETF2L, Ozfortress, and AsiaFortress.
  • All playoff matches of the next division down in each of the same leagues (e.g. ESEA-IM, ETF2L High, etc).
  • The grand final of the third tier down (e.g. ESEA-Open or ETF2L Mid).

The above logic also applies to secondary cups outside of each region’s regular season, e.g. the Ozfortress Midsummer Night’s Cups. Furthermore:

  • Every Invite-tier match from big LANs like Insomnia or ESA Rewind (including the group stage and matches involving promoted Open teams who have joined the Invite teams in the playoffs).
  • Playoff matches at smaller LANs (e.g. Dreamhack Battle for the North) or secondary tournaments (Essentials.TF One Night Cups).
  • The grand final of small secondary tournaments or LANs with a broadly low-High/top-Mid level of competition.
  • Any showmatch between top-level teams.

There’s an important distinction to be made between two types of tournaments – those that are split into tiers and those that aren’t. In the former, the guidelines mentioned above (all tier-1 matches, tier-2 playoffs, tier-3 finals) apply. Note that tier-1 refers to a set Invite/Prem skill level, so if the top tier in the actual tournament only features IM/High-level teams, tier-2 logic is applied. In tournaments that aren’t split into tiers but still feature a wide spread of skill (like in the Essentials.TF Monthlies), generally all playoff matches go on record as long as there’s a top-level presence in the tournament.

Matches that meet the above criteria will still not be counted if one team was clearly not trying at all (e.g. they ran two spies and no medic the whole time), or if it was a 5v6, and so forth.

OZF - TeaTime v Jazz Men

Each match that meets all these criteria has a small chart made for it, like the one above. Some basic stats are logged: DPM for the combat classes and Heals per Minute for medics, plus KA/D.

The green box you’ll see in the top-left contains a number based on the general notoriety of the talent involved. This is used for match weighting (hence the anvil), which I’ll explain later.

On the opposite side of each chart is a box titled MVPs. This list is automatically generated and it’s based on stats from the match. It compares how high each player’s numbers were in relation to everyone else, and also in relation to his counterpart on the other team, and uses those comparisons to decide which four players appear the most prominent. How well each of the four players stood out is shown as a percentage (which doesn’t really have a specific meaning), while the number in brackets signifies how many times that same player has been MVP’d in the past 500 matches. It’s quite subjective and tends to undervalue certain roles, so the MVP box has no effect on the rankings.

What does affect the rankings are the DPM (or HPM) numbers, KA/D numbers, and match results. In the TF2Metrics system, players are only ever compared to the player on the other team who played the same role as them (e.g. pocket soldier, flank scout, etc). If a player exceeds his counterpart in either DPM (or HPM) or KA/D, he receives receives what’s called a distinction. This is just the first of the three distinctions you can get. If he exceeds his counterpart in both of these stats, he secures the second type of distinction. If you manage to get this type of distinction, you also by definition must have got the first as well. The third type of distinction is awarded to every player on the winning team. If the match was a tie (e.g. one map each), nobody receives this distinction. The colour backgrounds behind each player’s name signifies how many distinctions they got in this match: white for none, bronze for one, silver for two, and gold for all three.

This means each player in the match can secure a maximum of three distinctions for each match he participates in, and each team of six can therefore secure a maximum total of eighteen. The share of distinctions between all the teams is shown in the two boxes at the top of each chart near the team names.

Not all matches are of equal value with regards to the rankings, however, and each match gets this value from its weighting, as shown in green in the top-left corner of each match chart. As stated earlier, match weights are determined by the overall notoriety of the players involved, so low-level matches will usually have low weights (perhaps up to 30%) while matches between the world’s best teams can stray up into the 130% range. The minimum weight for a match is 5%.

The base value of each distinction is 1, but this is modified by the match weighting. Each match also gives every player involved an additional base value of 1 in ‘experience’, a value separate from distinctions, and this value is also modified by the weighting in the same way. A 10%-weighted match would give a player a total of 0.1 ‘experience’ to his name and, if he got all three distinctions, 0.3 added to his distinctions count, which is better understood as a ‘success’ stat.

Next to all the player names in the match charts are a pair of numbers, one on a background that’s some shade of gold and one on grey. The first of these tells us how much ‘success’ a player has had in total in the preceding 500 matches (which covers perhaps 10 months), and the second is his accumulated ‘experience’ from that same time-frame. These are the stats that are used to calculate a match’s weighting. The bold numbers in the same columns next to each team name give an average of all their players’ numbers.

The reason I limit the stat pool to the last 500 matches is to help the rankings stay representative of the current scene. It means retired players gradually drop unceremoniously off the bottom of the list and current players aren’t held back or over-complimented by matches more than about 10 months old. Matches outside the most recent 500 are no longer considered ‘on-record’ since they no longer affect the rankings.

The actual rankings list contains every player to have made an appearance in that last 500 matches. If they’re on a top-ish level team, this will be shown next to their name. During off-seasons, players are generally assigned to whatever team they played on last season until the new season is about to begin. Further along is their region, and their peak (the highest rank they’ve had since this system began close to two years ago). The next two columns show their total experience and success.

It’s the following two columns that decide their final score, though. Green shows their hit-rate – the percentage of all possible success points they successfully secured via distinctions in all their on-record matches. The purpose of this number is to ensure that players who don’t have many matches on-record, but were very successful in those few, are still ranked reasonably highly. The pink column (which I call the mileage) takes the opposite stance – it divides a player’s total success by three and adds it to their experience. The purpose of this is to ensure that experienced players, even if they’ve not been especially heavily rewarded, will still be ranked reasonably highly. This also has the effect of putting a soft cap on players from the slightly less active regions of Asia and Australia, who tend to host fewer matches all in all than Europe and North America.

The final score looks at each of these two numbers in turn and compares them to the rest of the column. It counts how many players in the list have a lower hit-rate than you, and adds that to the number of players who have a worse mileage than you. The highest possible score is always 1000, so the result of the calculation is usually going to be modified down a bit as long as there’s more than 500 players in the list (currently there’s close to 800).

Teams are also ranked in this system, and their scores are based on the average scores of every member of the team. Sometimes, though (principally during the pre-season), it has to assess teams featuring players who are currently unranked.

Unranked players are assumed to have a score of roughly 250, but the exact figure changes slightly as the list grows and shrinks. The way this all works is that all players, including those who don’t even have an entry on the list, are given a tiny behind-the-scenes boost to their hit-rate and mileage. This is invisible and doesn’t show up in the list, and it has barely any effect on people who have lots of matches on-record. It does mean, though, that people who haven’t even got an on-record match yet start out with some default stats, which are enough to give them a more reasonable starting score. If their first actual on-record match goes well, they’ll probably see that number increase, but if their opening match goes badly, they’ll likely drop down from this initial placement.

This ‘default score’ is also used by the projection machine (which spits out match result predictions based only on who’s playing) whenever it needs to account for a player who isn’t currently ranked. The way the projection machine works is it looks at the six players on each team and takes an average of all their scores. Each team’s average is then compared to that of the other and then it goes through some convoluted processes to produce a score projection.

This score projection is supposed to represent an average across all maps played. Over time, as I’ve taken note of themes and disparities between the machine’s predictions and real outcomes, I’ve fine-tuned the projection machine to be more accurate. For example, it expects matches to be closer and closer the lower the average score is between the two competing teams, while matches between world-class teams are usually expected to be reasonably decisive unless their scores are extremely close. It’s also able to work with different types of score caps (like those seen on KOTH maps and in ESEA matches) and mercy rules.

The projection machine makes a prediction before each on-record match and this is listed on each match chart on the left. This projection is then compared to the true outcome and its accuracy level is determined. The inaccuracy figure is determined by how well the machine guessed the proportionality of round wins between the teams, and also by how closely it predicted the actual score figure for each team. If it predicted a 5-0 and the actual result was a 0-5 in the other team’s favour, that would be 100% inaccuracy. Usually the machine’s predictions are broadly accurate (5-25% inaccurate). Sometimes it’s spot-on, but equally often it’s quite far off (40%+ inaccuracy).

I update the rankings list weekly, usually on Tuesdays, and make a post to accompany this, complete with charts for the previous week’s matches and analysis about what’s changed regarding the pecking order since the last update.

Overall this is a simplistic system whose only requirement is the availability of certain stats. Its judgements will never be as accurate as those that come from proper analysts with an expert eye who can point out real factors that decide what makes a team or player more capable than others. This system is just a broad brush and it makes more than its fair share of errors. Stats can only tell you so much in the quest to find out who’s really the best of the best.