Why this system works the way it does

The purest and most reliable way to rank TF2 teams is by their match results. No metric paints a better picture of a team’s relative potency than their record within their regional league. In ETF2L, Lemmings scored more points in the group stage than Nunya because they won more matches. It’s plain that this is the superior team. Yet, in the rankings used here, Lemmings are ranked below Nunya despite the unarguable fact that it was the former that performed better in the season. This is a symptom of the imperfections of the TF2 Metrics system, in part because at its core it was designed to rank individual players first and foremost.

It’s also clear to see that some of its player rankings are misguided as well. Dr. Phil, one of Europe’s top demomen of recent seasons, barely broke into the world’s top 80 by this metric. A big part of this is just a limitation of stats from logs. One of the first things I established in yesterday’s post was that there’s an awful lot more to being good at TF2 than being a producer of big numbers. Stark, Freestate, and Dr. Phil are examples of this – the numbers can’t tell the whole story for individuals like these. That’s why the eye of an expert will always be deserving of more trust than a broad brush number-cruncher like this. For individuals like these, match results do them better justice – Stark and Freestate both merit(ed) a position on their region’s greatest team, and Dr. Phil’s team, LEGO, was for a good while one of Crowns’ toughest playmates.

But the truths of pure match results have limitations of their own. Uubers and Muuki are on the same team and have therefore shared the same results as of late. The same applies to Sorex and Cold Heart on Lemmings. But does an identical results sheet indicate parity in raw skill? Sometimes it can, but often not. Most would agree that Uubers is a more potent TF2 player than Muuki is, and that Sorex is similarly above Cold Heart, at least for now. That’s what the purpose of the TF2 Metrics system is – to provide a means by which players can stand out not only among their opponents but from among their teammates.

I’ll reiterate again that just because this system sees someone as superior to somebody else, because their numbers were a bit bigger, doesn’t necessarily mean it’s true. It’s currently so excited about Adamracek that it believes he’s now at a level that Smirre, Adysky, and Dr. Phil haven’t reached in recent times. A level-headed analyst, however, would probably stop well short of saying something like that.

By and large, though, the right players tend to stick out. Despite being on the same team and winning and losing all the same matches together, Uubers is currently well established on Page 1 while Muuki is down in the 200s on Page 5. The latter’s situation is also probably skewed by the nature of this system, though, because remember it rewards DPM. Often we see Muuki on Engie or Spy, classes hardly ideal for that. Regardless, a system based only on match results would see these two as being equals.

It’s similar with Sorex and Cold Heart. I think most would agree that Sorex is a world-class scout and he holds, probably quite deservedly, the rank of 21st-best in the world by this metric, a short distance behind Yomps and Puoskari and a short way ahead of Yui and Elmo. Cold Heart, despite sharing all the same match results as him this season, actually never got gilded (which is probably rather harsh) and so is ranked low. Again, looking at this season’s match results alone would expose no skill disparity within their team.

This can of course be alleviated to a degree by expanding the scope within which the data is gathered. Cover five or six seasons rather than one or two and suddenly Sorex has miles more Prem wins than Cold Heart does. But there are drawbacks here, too. LEGO stuck together with few changes for ages and if a team did this for long enough then the same problem arises where the team’s more capable players end up with no way of separating themselves from their peers.

Another issue is when great new players explode onto the scene, like Puoskari and Maros. A total wins system with a large scope would cause inaccuracies with players like these with season-long delays before they finally accumulate enough wins to get the status they deserve as being among Europe’s top players.

What about a percentage wins system, then? Again there are drawbacks. As things currently are in Europe, there’s no way Sorex could match Puoskari in win percentage despite them being players of probably similarly great skill. And what if a player spends two seasons underperforming on a low-level Prem team and then suddenly something clicks and he finds himself as one of the following season’s best players and winning many matches? His win percentage across those three seasons could still only be perhaps 30%. Given time the figure would correct itself, of course, but at the end of that third season he’d be ranked behind average players on average teams with ~50% win rates. Shrinking the scope to only include one or two seasons at a time helps to solve this but now we find ourselves back at the beginning again, where Cold Heart or Muuki can draw level with Sorex or Uubers.

And that’s why the TF2 Metrics ranking system works the way it does. It makes plenty of mistakes of its own and it will always undervalue the more logs-modest and disciplined players like Freestate and Stark. But at the same time, it strikes a good balance between reflecting the competitive TF2 scene as it is in that particular moment, rewarding players who win matches, and rewarding players who don’t win so many matches but still excel on their team.

Nevertheless, this system will never hold a great deal of authority as to the true order of things. What it’ll never be anywhere near as good as, naturally, is the ultra-informed judgement of an expert, and it’s from them that the best and truest insight will always come.

They say the stopwatch don’t lie, but numbers in TF2 are different. They can be suggestive of things, but never fully indicative. That will always be an intrinsic weakness of a stats-based ranking system such as this.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s