This page was last updated on 04/02/2018 and needs a few very minor additions, but nothing below is untrue and it can be read without being led astray.
The most important thing that should be understood about TF2Metrics is that it’s just a bit of fun – an experiment in playing with statistics. To put it in GCSE terms, the purpose of TF2Metrics is more to entertain than it is to inform. The second-most important thing to know is that this doesn’t mean I don’t take it seriously and try hard to make it as accurate as possible.
In the dichotomy of inform and entertain, the rankings themselves fall very predominantly into the latter category. None of the conclusions it spits out are supposed to be taken as gospel. This ranking system was designed from day 1 to be, above pretty much all else, versatile. By being based around only a few simple statistics, players who I’ve never even seen play before can be ranked with the same precision as someone I’ve known about for years. Centring the rankings around players rather than teams also means it can make predictions about how good a newly-formed team might be, and it can respond quickly to roster changes. However, limited stats can only ever tell you so much. If, say, you really wanted to get an idea of who the world’s best roamers are and why, you’re far, far better off having a chat with someone like Nuze than skimming through TF2Metrics.
The rankings’ most useful feature, from a purely informational perspective, might be how I keep it updated with the rosters of the world’s more prominent teams. If, for example, you want to catch up with what the current Ozfortress teams are, and get a rough idea of how they compare to each other, you can do this easily using the rankings.
Also on the informative side of things lies the accompanying blog. The meat of this is made up of the weekly Update posts which provide an overview of how the rankings have shifted over the course of the preceding week. However, there’s more on offer here. If you want to quickly catch up on the previous week’s matches, you can do that without much trouble using the Update posts. Each one features charts for all the matches from the week that influenced the rankings (details on that later). Each Update post can therefore be treated as a weekly TF2 newsletter with a basic rundown of all that happened in ESEA, ETF2L, Ozfortress, and so on.
This project originally came to be almost two years ago – I’ve been keeping track of certain stats from top-ish level TF2 matches since May 2016, and I’ve gradually built a ranking system around it. The way this system works has gone through a number of iterations, with each exhibiting more accuracy than the last.
I have a spreadsheet that I use to log data from certain TF2 matches. There are guidelines I use to decide what matches should and shouldn’t go ‘on-record’. I used to only include a handful of second tier (ESEA-IM, ETF2L High, etc) level matches, but the recent addition of match weighting means the scope of what can go on-record has greatly expanded. Below are the sorts of matches that count towards the rankings:
- All officials from the top division of ESEA, ETF2L, Ozfortress, AsiaFortress, and Chapelaria.
- All playoff matches of the next division down in each of the same leagues (e.g. ESEA-IM, ETF2L High, etc).
- The grand final of the third tier down (e.g. ESEA-Open or ETF2L Mid).
The above logic also applies to secondary cups outside of each region’s regular season, e.g. the Ozfortress Midsummer Night’s Cups. Furthermore:
- Every Invite-tier match from big LANs like Insomnia or ESA Rewind (including the group stage and matches involving promoted Open teams who have joined the Invite teams in the playoffs).
- Playoff matches at smaller LANs (e.g. Dreamhack Battle for the North) or secondary tournaments (Essentials.TF One Night Cups).
- The grand final of small secondary tournaments or LANs with a broadly low-High/top-Mid level of competition.
- Any showmatch between top-level teams.
There’s an important distinction to be made between two types of tournaments – those that are split into tiers and those that aren’t. In the former, the guidelines mentioned above (all tier-1 matches, tier-2 playoffs, tier-3 finals) apply. Note that tier-1 refers to a set Invite/Prem skill level, so if the top tier in the actual tournament only features IM/High-level teams, tier-2 logic is applied. In tournaments that aren’t split into tiers but still feature a wide spread of skill (like in the Essentials.TF Monthlies), generally all playoff matches go on record as long as there’s a top-level presence in the tournament.
Matches that meet the above criteria will still not be counted if one team was clearly not trying at all (e.g. they ran two spies and no medic the whole time), or if it was a 5v6, and so forth.
Each match that meets all these criteria has a small chart made for it, like the one above. Some basic stats are logged: DPM for the combat classes and Heals per Minute for medics, plus KA/D.
The green box you’ll see in the top-left contains a number based on the general notoriety of the talent involved. This is used for match weighting (hence the anvil), which I’ll explain later.
On the opposite side of each chart is a box titled MVPs. This list is automatically generated and it’s based on stats from the match. It compares how high each player’s numbers were in relation to everyone else, and also in relation to his counterpart on the other team, and uses those comparisons to decide which four players appear the most prominent. How well each of the four players stood out is shown as a percentage (which doesn’t really have a specific meaning), while the number in brackets signifies how many times that same player has been MVP’d in the past 500 matches. It’s quite subjective and tends to undervalue certain roles, so the MVP box has no effect on the rankings.
What does affect the rankings are the DPM (or HPM) numbers, KA/D numbers, and match results. In the TF2Metrics system, players are only ever compared to the player on the other team who played the same role as them (e.g. pocket soldier, flank scout, etc). If a player exceeds his counterpart in either DPM (or HPM) or KA/D, he receives receives what’s called a distinction. This is just the first of the three distinctions you can get. If he exceeds his counterpart in both of these stats, he secures the second type of distinction. If you manage to get this type of distinction, you also by definition must have got the first as well. The third type of distinction is awarded to every player on the winning team. If the match was a tie (e.g. one map each), nobody receives this distinction. The colour backgrounds behind each player’s name signifies how many distinctions they got in this match: white for none, bronze for one, silver for two, and gold for all three.
This means each player in the match can secure a maximum of three distinctions for each match he participates in, and each team of six can therefore secure a maximum total of eighteen. The share of distinctions between all the teams is shown in the two boxes at the top of each chart near the team names.
Not all matches are of equal value with regards to the rankings, however, and each match gets this value from its weighting, as shown in green in the top-left corner of each match chart. As stated earlier, match weights are determined by the overall notoriety of the players involved, so low-level matches will usually have low weights (perhaps up to 30%) while matches between the world’s best teams can stray up into the 130% range. The minimum weight for a match is 5%.
The base value of each distinction is 1, but this is modified by the match weighting. Each match also gives every player involved an additional base value of 1 in ‘experience’, a value separate from distinctions, and this value is also modified by the weighting in the same way. A 10%-weighted match would give a player a total of 0.1 ‘experience’ to his name and, if he got all three distinctions, 0.3 added to his distinctions count, which is better understood as a ‘success’ stat.
Next to all the player names in the match charts are a pair of numbers, one on a background that’s some shade of gold and one on grey. The first of these tells us how much ‘success’ a player has had in total in the preceding 500 matches (which covers perhaps 10 months), and the second is his accumulated ‘experience’ from that same time-frame. These are the stats that are used to calculate a match’s weighting. The bold numbers in the same columns next to each team name give an average of all their players’ numbers.
The reason I limit the stat pool to the last 500 matches is to help the rankings stay representative of the current scene. It means retired players gradually drop unceremoniously off the bottom of the list and current players aren’t held back or over-complimented by matches more than about 10 months old. Matches outside the most recent 500 are no longer considered ‘on-record’ since they no longer affect the rankings.
The actual rankings list contains every player to have made an appearance in that last 500 matches. If they’re on a top-ish level team, this will be shown next to their name. During off-seasons, players are generally assigned to whatever team they played on last season until the new season is about to begin. Further along is their region, and their peak (the highest rank they’ve had since this system began close to two years ago). The next two columns show their total experience and success.
It’s the following two columns that decide their final score, though. Green shows their hit-rate – the percentage of all possible success points they successfully secured via distinctions in all their on-record matches. The purpose of this number is to ensure that players who don’t have many matches on-record, but were very successful in those few, are still ranked reasonably highly. The pink column (which I call the mileage) takes the opposite stance – it divides a player’s total success by three and adds it to their experience. The purpose of this is to ensure that experienced players, even if they’ve not been especially heavily rewarded, will still be ranked reasonably highly. This also has the effect of putting a soft cap on players from the slightly less active regions of Asia and Australia, who tend to host fewer matches all in all than Europe and North America.
The final score looks at each of these two numbers in turn and compares them to the rest of the column. It counts how many players in the list have a lower hit-rate than you, and adds that to the number of players who have a worse mileage than you. The highest possible score is always 1000, so the result of the calculation is usually going to be modified down a bit as long as there’s more than 500 players in the list (currently there’s close to 800).
These finals numbers are then modified one more time to give each player’s proper score. Each region represented in the rankings has an automatically-counted modifier to it that’s based on how much experience players from that region currently have on record all in all. For example, at the time of writing, Australia’s modifier comes out at about 60%. However, this modifier doesn’t apply equally to all players from the region – instead, players only get a certain proportion of it, the size of which is decided by how far down the rankings they would have been without it. For example, and Australian who would’ve been ranked halfway down the list would have his score modified at 80% – in other words, he’d take half of his regional modifier. Another Australian otherwise ranked three quarters of the way up would face a 90% modifier. Doing it this way means it’s still possible for Australian and even Asian players to hold very prominent ranks.
Teams are also ranked in this system, and their scores are based on the average scores of every member of the team. Sometimes, though (principally during the pre-season), it has to assess teams featuring players who are currently unranked.
Unranked players are assumed to have a base score of roughly 250 (this figure is also subject to regional modifiers, so an unknown Australian may have a base score closer to something like 150), but the exact figure changes slightly as the list grows and shrinks. The way this all works is that all players, including those who don’t even have an entry on the list, are given a tiny behind-the-scenes boost to their hit-rate and mileage. This is invisible and doesn’t show up in the list, and it has barely any effect on people who have lots of matches on-record. It does mean, though, that people who haven’t even got an on-record match yet start out with some default stats, which are enough to give them a more reasonable starting score. If their first actual on-record match goes well, they’ll probably see that number increase, but if their opening match goes badly, they’ll likely drop down from this initial placement.
This ‘default score’ is also used by the projection machine (which spits out match result predictions based only on who’s playing) whenever it needs to account for a player who isn’t currently ranked. The way the projection machine works is it looks at the six players on each team and takes an average of all their scores. Each team’s average is then compared to that of the other and then it goes through some convoluted processes to produce a score projection.
This score projection is supposed to represent an average across all maps played. It’s also capable of producing percentage win chances for the teams involved. Over time, as I’ve taken note of themes and disparities between the machine’s predictions and real outcomes, I’ve fine-tuned the projection machine to be more accurate. For example, it expects matches to be closer and closer the lower the average score is between the two competing teams, while matches between world-class teams are usually expected to be reasonably decisive unless their scores are extremely close. It’s also able to work with different types of score caps (like those seen on KOTH maps and in ESEA matches) and mercy rules.
The projection machine makes a prediction before each on-record match and this is listed on each match chart on the left. This projection is then compared to the true outcome and its accuracy level is determined. The inaccuracy figure is determined by how well the machine guessed the proportionality of round wins between the teams, and also by how closely it predicted the actual score figure for each team. If it predicted a 5-0 and the actual result was a 0-5 in the other team’s favour, that would be 100% inaccuracy. Usually the machine’s predictions are broadly accurate (5-25% inaccurate). Sometimes it’s spot-on, but equally often it’s quite far off (40%+ inaccuracy).
I update the rankings list weekly, usually on Tuesdays, and make a post to accompany this, complete with charts for the previous week’s matches and analysis about what’s changed regarding the pecking order since the last update.
Overall this is a simplistic system whose only requirement is the availability of certain stats. Its judgements will never be as accurate as those that come from proper analysts with an expert eye who can point out real factors that decide what makes a team or player more capable than others. This system is just a broad brush and it makes more than its fair share of errors. Stats can only tell you so much in the quest to find out who’s really the best of the best.