This post is also available in / Disponible en: : English
Let’s face it: one of the most time-consuming things in the world is the process of ranking any category of players for (fantasy) baseball purposes, let it be pitchers (starters or relievers), batters, prospects, mascots, whichever you need/want to rank.
Furthermore, most of the time, it can be a very subjective task and that makes the process even more difficult; you probably will not rank the same set of players and data equally in two different opportunities.
One way to do rankings is, as an example, to use one stat and simply sort the list from better to worse, according to said stat. Then, you can set fixed intervals among this stat, and that way you can separate the members into tiers.
Let’s see an example of how this works for a list of 25 qualified pitchers and their SwStr% (percentage of pitches that ended in swinging and miss) this season through 5/2:
I’ve highlighted Ian Anderson as his SwStr% of 12.70% is very close to the average for the league so far in 2021. Knowing this, then every pitcher with a higher SwStr% than Anderson is “above” average and lower is “below”, so we could divide the scale into equal intervals from Anderson and up, and also from Anderson and down; the ranked list, with this simple method, would now look like this:
I’ve color-coded each tier for easier viewing and included Anderson in the following upper one, 4th, but he could go in the opposite way.
On the surface, this might look like a good way to rank these pitchers in tiers, according to their SwStr%, but there is a big issue with it: the number of divisions was arbitrarily selected, in this case equally divided in four upper and four lower tiers with a bigger one between them, but then we could’ve divided in 5,6 or any other number of upper and lower tiers. Or simply choose any other different ranges.
The appropriate way of doing this would be to analyze the effect of SwStr% on the dependent variable that we define as the success indicator (SIERA, FIP, etc.), and then to test the percentile influence of it and segment accordingly. But that takes a lot of time, which sometimes we can’t afford.
So, what can we do to find a balance to tackle this situation? Let’s talk about the normal distribution, then.
I’ll leave the math-y, deeper explanations to others but in a nutshell, the normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, the normal distribution will appear as a bell curve.
In simple terms, a lot of the quantifiable things in the world fall close to average, and the extremes (“better” or “worse” than said average, are fewer and farther from it). For example, for every Josh Hader’s 40.2 and Shelby Miller’s 14.1 CSW% there are way more Liam Hendriks’ 27.8%; way, way more. The CSW% distribution for all pitchers through 05/04, looks like this:
Unsurprisingly, this graph resembles the bell shaped curve of the normal distribution.
Every bar groups (“bins”) a number of pitchers whose CSW% is within a standard deviation for that group, so as an example, for a CSW% between 26 and 31.2% there are currently 239 pitchers, but on the high and low spectrum (higher than 41.6 or lower than 10.4) there are very few.
This distribution is the kind of distribution that we can use for an initial rank.
Using this method, for example, we can rank in tiers for the most valuable pitchers in the league, according to fWAR (qualified players):
So instead of starting with a full list which we have to dissect into tiers, we can have it divided with a logical criterion, and then we can tweak it to our liking.
To make things easier, I have created a simple Google Spreadsheet where you can paste any list of names or items you have and its associate stat, and it will rank it in tiers of one or half standard deviation, as you wish.
Paste the list in the designated area in the “Input” tab, and get the result in the “Summary” tab, that’s it. This is the link: http://bit.ly/sd-x-ranker. You can also download it and use it locally on your computer.
This tool does not create a perfect normal distribution but provides a good and fast approximation.
So, who is more valuable, Cole or deGrom? That’s up to you to decide, now.
EE, Data geek, Baseball fan. Twitter: @camarcano