SoxProspects News
|
|
|
|
Legal
Forum Ground Rules
The views expressed by the members of this Forum do not necessarily reflect the views of SoxProspects, LLC.
© 2003-2024 SoxProspects, LLC
|
|
|
|
|
Forum Home | Search | My Profile | Messages | Members | Help |
Welcome Guest. Please Login or Register.
ericvman and jmei debate player projection
ericmvan
Veteran
Supposed to be working on something more important
Posts: 9,027
|
Post by ericmvan on Nov 15, 2013 2:41:53 GMT -5
I agree that a player can't just turn it on during a contract year... I think Salty has progressed every year he has been on the team with that being said his BAIP was very very high and something he wouldn't be able to replicate next year. Having that come back down to earth around the league average he should hit around .240 15-20 hrs and 60-70 RBI. That is very solid production for a switch hitting catcher that has developed a great chemistry and willing to take a home town discount on a deal. Give him a 3 year 30 million dollar deal. Well, I just did a regression analysis on all 1241 player seasons (minimum 400 PA) from 2008-2013, the era of elevating strikeout rates. The analysis is based on the idea (backed up by a wealth of evidence, which I'll summarize in a bit) that the fundamental hitting stat, after K% and UBB%, is extra base hits per contact. XBH are seldom lucky. XBH/C or XBC for short is the purest measure of how hard a guy has hit the ball. Hence when we say of a prospect like Cecchini that "we expect some of those doubles to start turning into homers" we are talking about a stylistic change within an established and constant talent level. Once you know a guy's K%, UBB%, and XBC, you can predict his HR per XBH. Flyball hitters will consistently exceed that prediction by a bit and groundball hitters will consistently fall a bit short, but the first three rates do set the general expected HR/XBH (or HRX for short) level, because guys with more power as indicated by XBC will also tend to have a higher percentage of those XBH leave the yard (and ditto for guys with better strike zone command). And once you know those four numbers, you can predict a completely independent measure from HRX (which is one of the advantages to starting with XBH), the percentage of remaining balls in play that are singles (singles / (singles + outs in play)), which I call 1B% for simplicity. This is the expected number of singles given your strike zone command and how hard you hit the ball, and where you hit it (since it uses the actual HRX, which is modified by your flyball / groundball tendencies; all things being equal, as HRX goes up, 1B% goes down, because flyball hitters have lower BABIPs). Fast guys will exceed their expected 1B% and slow guys will fall short. (A later version of this will incorporate metrics for speed.) So. Saltalamacchia. Based on his 2013 K%, UBB%, XBC, and HRX, you'd expect a 1B% of .261. He actually had .265. That's exactly one more hit than expected. Since he's slow, it's actually probably 2 or 3 lucky hits. First conclusion (to be superseded, though): if his XBH total was not inflated by luck, then his BABIP was barely inflated as well. (I believe I earlier reached a similar conclusion by looking at the breakdown of each type of ball in play and comparing them to league average.) But wait. The thing that's actually freaky about Saltalamacchia's 2013 was the crazy low percentage of XBH that went for HR -- .259, which is to say, 40 2B and 14 HR. Salty had the 18th best XBC in these 1241 seasons, but the second lowest HRX among the top 40 seasons. Salty's predicted and actual HRX the last three years: .517, .381 .502, .581 (average .482 over 2011-2012) .512, .259. Let's substitute a .482 for his actual HRX, which means turning 12 doubles into homers. Now his expected 1B% drops to .245, which means he had 5 lucky hits. Instead of hitting .273 / .338 / .466, he hits .261 / .328 / .511, which is a better season (.281 estimated EqA versus .276). So the question comes down to: was his .188 XBC inflated by luck? Of those 40 doubles, how many were lucky and cheap, and how many more was that than the average player would have had? I actually remember looking at every one of his doubles and posting the list here somewhere, and all but a few were either deep fly balls or line drives. So let's say he had two lucky doubles. That drops his XBC to .181, and since he had a .175 two years ago when his K rate was an adjusted .322, versus .298 this year, and his UBB rate .059 versus .086 this year, I find that credible. Now we end up losing seven hits according to the projection, and he has a .256 / .323 / .496 line, which is precisely the same .276 EqA as his actual season.
Conclusion: he probably had some good luck on balls in play, which was offset by a lot of bad luck in getting balls to leave the yard instead of hitting near the tops of walls. Salty has improved both his K rate and UBB rate each of the last three seasons, and he's consistently been in the top 10% for hardness of contact in all of MLB. He can be expected to regress some next year, because guys having their best season yet usually do, but if so, virtually none of that regression will be because he was lucky. (A key to understanding his career is that his 1B% the two previous seasons was far lower than expected: .212 and .207 versus an expected .257, .235.). On the other hand, given that he's only 29 and that catchers often mature late as hitters, and given his strike zone command improvement, a season as good or better would not be at all surprising. The argument for this new set of metrics in a nutshell: they correlate with one another less than the more traditional HR/Contact, BABIP, XBH/Hits in Play, which is to say they do a better job of measuring three different aspects of hitting. And yet their collective year-to-year variation (from 1980 to 2013) in overall MLB rates is a lot less than the traditional set. And if you look at batting lines by base-out state, this set of metrics all correlate with one another across the 24 states and all correlate inversely with both K rate and UBB rate, which is not true for the conventional set (or any other alternative). So when pitchers try to pitch around hitters, they actually succeed in reducing all three of these metrics, consistent with the idea that they are three real aspects of hitting. If you look at the conventional set, you see XBH/Hits in play tending to increase as pitchers try to pitch around hitters, which suggests that it doesn't represent anything real; it's just what's left over after we start with BABIP and then do the next logical thing by looking at HR/Contact.
|
|
|
Post by jmei on Nov 15, 2013 8:30:52 GMT -5
The analysis is based on the idea (backed up by a wealth of evidence, which I'll summarize in a bit) that the fundamental hitting stat, after K% and UBB%, is extra base hits per contact. XBH are seldom lucky. Do you mind posting the data for this? What is the year-to-year correlation of XBC and HRX?
|
|
ericmvan
Veteran
Supposed to be working on something more important
Posts: 9,027
|
Post by ericmvan on Nov 15, 2013 9:28:18 GMT -5
The analysis is based on the idea (backed up by a wealth of evidence, which I'll summarize in a bit) that the fundamental hitting stat, after K% and UBB%, is extra base hits per contact. XBH are seldom lucky. Do you mind posting the data for this? What is the year-to-year correlation of XBC and HRX? Haven't done the individual player YTY correlations. That's the next step. However, that's just one of the things you're looking at. Standard deviation / mean, annual MLB rates, 1980-2013: .016 1B% .025 BABIP .065 XBH/HIP (aka XHP) .069 HRX .108 XBC .166 HRC There has been a lot more YTY variation in overall MLB HRC rates than XBC rates. So when we start with HRC, we are overlooking the evidence that XBC is more fundamental, and that HRC is basically double-counting hardness of contact (HOC) because as XBC rises, so also does the proportion of XBH that leave the park. Inter-correlations (across the 34 years) HRC and BABIP .873 HRC and XHP .913 XHP and BABIP .861 For some purposes, this is good. The three metrics really tend to move together from year to year. But what I believe you're doing here is taking generic HOC, and parceling it out into three metrics, each of which also contains some specific skills as well. HRX and 1B% .533 XBC and 1B% .631 XBC and HRX .866 This looks like it does a much better job of isolating three different skills, and of course logically it does, too. Jose Iglesias really didn't have a fluke BABIP year; he had a fluke 1B% year, and future efforts to measure his actual talent at infield hits will work better looking at 1B% than BABIP. What is really interesting is that in predicting the annual MLB K rate, they yield very different regression formulas, although both do an excellent job. The prediction using HRC and BABIP has no terms for annual BB rate; its influence on K rate is no longer discernible when you're using three different measures of general HOC. The regression with XBC and 1B% does include BB rate, and uses fewer HOC variables (including interactions). But what really grabs me is this: in neither regression are terms for "steroid era" significant. Yet the regression using HRC and BABIP is generally more accurate, while the regression using XBC and 1B% kicks its butt in two places: the steroid era and the run-up to the 1987 HR peak. I'm looking into that further today, and then I hope to write the K rate prediction stuff up for publication.
|
|
|
Post by jmei on Nov 15, 2013 9:40:03 GMT -5
I asked because I don't think non-HR extra-base hit rates are a great proxy for hardness of contact. There are a lot of extra-base hits that do strike me as somewhat "lucky" (or at least as non-probative of hard contact)-- ground balls down the line and line drives into the gap, for instance. Moreover, 2B/3B rates are also heavily influenced by ballpark and speed, two exogenous factors that muddy your hypothesis.
I do agree with your broader point re: Saltalamacchia (that he's going to hit more home runs next year, which will offset at least some of the BABIP regression), but you haven't convinced me that XHC or HRX are stable enough year-to-year to be used in projecting a player the way you have here.
EDIT: Obvious YTY regressions would be a huge undertaking, and I don't expect you to do so. However, expect me to remain skeptical of this type of analysis going forward, though I won't push back too much in any individual thread.
|
|
ericmvan
Veteran
Supposed to be working on something more important
Posts: 9,027
|
Post by ericmvan on Nov 16, 2013 3:55:14 GMT -5
I asked because I don't think non-HR extra-base hit rates are a great proxy for hardness of contact. There are a lot of extra-base hits that do strike me as somewhat "lucky" (or at least as non-probative of hard contact)-- ground balls down the line and line drives into the gap, for instance. Moreover, 2B/3B rates are also heavily influenced by ballpark and speed, two exogenous factors that muddy your hypothesis. I do agree with your broader point re: Saltalamacchia (that he's going to hit more home runs next year, which will offset at least some of the BABIP regression), but you haven't convinced me that XHC or HRX are stable enough year-to-year to be used in projecting a player the way you have here. EDIT: Obvious YTY regressions would be a huge undertaking, and I don't expect you to do so. However, expect me to remain skeptical of this type of analysis going forward, though I won't push back too much in any individual thread. First, a point: HR are even more influenced by park than 2B and 3B are. I think the number of balls that are ambiguously singles or doubles, depending on the player's speed, is actually fairly small; certainly the influence of speed on doubles is less than the influence of swing trajectory (FB/GB rates) on HR, given equal hardness of contact. So the exogenous factors on HR are as large or larger. Now, I'm really good with Excel ... it took me 23 minutes to get my first set of Y2Y correlations and 2.5 hours to do about as thorough a job as I can imagine. Here are the correlations for every possible contact metric, 1980-2013, minimum 400 PA (if you'd like to know some other time frame or minimum number of PA, that would be trivial to do). I've bolded the conventional metrics and put my proposed alternatives in red. .870 K% .804 HRC.803 HR/Hit .777 BB% .759 HR/XBH.739 XBH/C (XBC).721 XBH/H .525 CBA .484 1B/C .451 XBH/BIP .442 BABIP.426 XBH/HIP.420 1B/(1B+OIP) aka 1B% .419 23/C HRC does have the best YTY correlation, but it's the product of two things that I'm proposing should be separated, both of which have correlations 92% - 94% as strong. (The product of two things which correlate with each other, each of which has a high correlation across time, I think, always yields an even higher correlation.) If you take HRC, you are left with one of two low-correlation pairs (the other being XBH/BIP and 1B%). My proposal is the only way to get two high-correlation metrics. OK, let's repeat this for 250 PA minimum. And I'll just report the percentage of the correlation strength retained, which is to say, this is the rank by sensitivity to sample size (least sensitive to most), confounded by whatever extra variance you get by adding less talented, low-PA players to the sample: . 988 HRC.983 K% .983 HR/Hit .981 XBH/C (XBC).970 XBH/H .967 BB% .960 HR/XBH.944 CBA .943 XBH/BIP .926 23/C .934 XBH/HIP.899 BABIP.893 1B/C .888 1B%
I've always felt that that 1B% was better than BABIP at isolating luck, and this backs that up. However, overall the new breakdown stands up to small sample sizes better, because HR/XBH doesn't deteriorate quite as much as XBH/HIP. And XBC (which stands up 99.3% as well as HRC does) stands up better than BB%, which is impressive. And now I'm going to do just the six metrics we're focusing on (plus K% and BB%), minimum 250 PA to get a larger sample size, for five different eras: 1980 - 1987, increasing K and hardness of contact 1988 - 1992, retrenchment* 1993 - 1996, the rise resumes 1997 - 2005, steroid era; relationship of MLB K rate to HOC much less predictable 2008 - 2013, rising K rate (pitch/fx and increased velocity) *I'm beginning to think that the famous 1987 spike in homers was largely the culmination of a trend of increasing strength from weight training, plus a bit of warm weather, and that MLB deadened the ball afterwards. In this era, K rates (a function of hardness of swing) were higher than predicted from HOC measures and walk rates, indicating that the reduced HOC was being generated by harder swings than the model thinks it was. Here's a table with the results (BBP = BABIP) Year N K% BB% HRC BBP XHP XBC HRX 1B% 1980 1130 .836 .760 .787 .398 .341 .684 .733 .394 1988 643 .860 .739 .806 .381 .324 .693 .758 .348 1993 446 .825 .754 .799 .320 .349 .696 .739 .322 1997 1254 .833 .768 .806 .327 .380 .733 .719 .332 2008 808 .851 .694 .731 .328 .373 .658 .657 .360 So what here might not be random? -- A steady decline in the Y2Y correlation of 1B% from 1980 to the mid-90's. Which is to say a reduction in the skill component (bunts, fast players who get a lot of infield hits). It may have risen again post-steroids. -- A spike in the Y2Y consistency of XBC (or XBH / Hits in play) in the steroid era, maybe. That would happen if you had a greater stratification of talent, i.e., guys using and guys not using. There was no change in the standard deviation, though. So did PED use somehow reduce Y2Y variation for users? -- A collapse in the Y2Y consistency of walk rates and hardness of contact (XBC and HRX, or their product HRC) in the exploding-K rate era. I'm not sure anyone has noticed that, and it's very interesting. I think the argument here is extremely solid. Both XBH/Contact and HR/XBH correlate really well from year to year. They correlate with one another, as well (.678). The extra strength of the Y2Y correlation of HOC doesn't mean it's a more fundamental stat than XBC; it happens because you're adding some flyball / groundball tendencies to XBC. If you put all your eggs into the HRC basket, you're left with two noisy stats. One of those, BABIP, is clearly not as good for isolating the luckiest outcomes as 1B% is, conceptually or empirically. In short, what hitters try to do is hit the ball hard. Yes, there are lucky doubles and triples, but most are hit significantly harder than singles. If you excel at hitting the ball hard, you're going to have more XBH hits per contact, and a higher percentage of those XBH are going to be home runs. It's answering two questions: what percentage of the balls you hit were hit hard? What percentage of the hard hit balls were hit really hard (and up in the air)? That gives you a double scale. You can differentiate guys who consistently hit the ball hard, but not super-hard (e.g., Pedroia) from guys who crush the ball. (Adding FB/GB data would do an even better job, but for the time being I'm trying to limit myself to data that will be available across all historical eras, and has no subjective element*.) Multiplying the two together to get HRC just combines your two more nuanced measures, so of course it correlates even better, but the price that you pay is that you don't have a second non-noisy stat. *Hard hit balls that first land on the mid to outer dirt part of the infield are often scored as line drives by BIS and ground balls by MLB, or vice versa. Aha, here, I think, is the smoking gun that doubles and triples can and should be first lumped together with home runs, not with singles. The within-year correlation of HR/Contact and 1B/Contact (all 6596 players with 400+ PA) is -.440. Guys who hit more of one tend to hit less of the other. So, no, as we all agree, you want to separate those. The correlation of (2B+3B)/Contact to HR/Contact is .260. The correlation of (2B+3B)/Contact to 1B/Contact is -.184. Clearly, they're a form of power hitting, not a form of non-home-run hitting.
|
|
|
Post by jmei on Nov 16, 2013 15:06:14 GMT -5
.870 K% .804 HRC .803 HR/Hit .777 BB% .759 HR/XBH .739 XBH/C (XBC) .721 XBH/H .525 CBA .484 1B/C .451 XBH/BIP .442 BABIP .426 XBH/HIP .420 1B/(1B+OIP) aka 1B% .419 23/C HRC does have the best YTY correlation, but it's the product of two things that I'm proposing should be separated, both of which have correlations 92% - 94% as strong. (The product of two things which correlate with each other, each of which has a high correlation across time, I think, always yields an even higher correlation.) If you take HRC, you are left with one of two low-correlation pairs (the other being XBH/BIP and 1B%). My proposal is the only way to get two high-correlation metrics. If I'm reading this correctly, your argument that XBC is just the product of XBC and HR/XBH is not exactly accurate, no? XBC is instead combining 23/C and HRC together, right? That involves combining one stat with a super high YTY correlation (HRC) with one with the lowest YTY correlation amongst hitting stats (23/C) (23/C rates are even more variable than 1B%!). Maybe I'm missing something, but why would we want to do that? Is it because of the deterioration analysis? Because I think that's mostly hand-waving which is predominantly driven by low-PA players who are injured or who do not have MLB-level skills, who are by far the most likely candidates to have small PA samples in any given season. It still seems like to me that HRC is the more fundamental stat that you're diluting by including 2B/3B into the equation. Yes, that includes some GB/FB noise, but I'm not sure why you think that's a bad thing. I'm not trying to get the cleanest theoretical explanation for player performance here. I'm trying to find the quickest and easiest way to project player performance, even if I have to resort to "messy" methods to do so. Indeed, messiness seems like an an advantage, not a flaw. If I can use one stat to stand-in for multiple skills, all the better-- it saves me a load of time and effort, and I haven't seen any evidence that your cleaner stats are more predictive. Along the same lines... ...I'm not getting why this is so meaningful. We're concerned about player projections here, and for those purposes, I don't see why these correlations are relevant. My argument is not that 2B/3B should be lumped with singles instead of home runs. My argument is that 2B/3B should be in its own bucket, one that is far closer in YTY variability to 1B than HR. So the fact that a hitter hit a bunch of doubles in one year may mean he "hit the ball hard" that year, but it also means that there's a very good chance he won't hit that many doubles the next year.
|
|
|
Post by ray88h66 on Nov 16, 2013 20:44:08 GMT -5
No intention of joining this debate, but I find it interesting, Some of it is beyond me, but it's thought provoking. Two smart posters. Carry on and thank you.
|
|
ericmvan
Veteran
Supposed to be working on something more important
Posts: 9,027
|
Post by ericmvan on Nov 16, 2013 21:01:28 GMT -5
.870 K% .804 HRC .803 HR/Hit .777 BB% .759 HR/XBH .739 XBH/C (XBC) .721 XBH/H .525 CBA .484 1B/C .451 XBH/BIP .442 BABIP .426 XBH/HIP .420 1B/(1B+OIP) aka 1B% .419 23/C HRC does have the best YTY correlation, but it's the product of two things that I'm proposing should be separated, both of which have correlations 92% - 94% as strong. (The product of two things which correlate with each other, each of which has a high correlation across time, I think, always yields an even higher correlation.) If you take HRC, you are left with one of two low-correlation pairs (the other being XBH/BIP and 1B%). My proposal is the only way to get two high-correlation metrics. If I'm reading this correctly, your argument that XBC is just the product of XBC and HR/XBH is not exactly accurate, no? XBC is instead combining 23/C and HRC together, right? That involves combining one stat with a super high YTY correlation (HRC) with one with the lowest YTY correlation amongst hitting stats (23/C) (23/C rates are even more variable than 1B%!). Maybe I'm missing something, but why would we want to do that? Is it because of the deterioration analysis? Because I think that's mostly hand-waving which is predominantly driven by low-PA players who are injured or who do not have MLB-level skills, who are by far the most likely candidates to have small PA samples in any given season. It still seems like to me that HRC is the more fundamental stat that you're diluting by including 2B/3B into the equation. Yes, that includes some GB/FB noise, but I'm not sure why you think that's a bad thing. I'm not trying to get the cleanest theoretical explanation for player performance here. I'm trying to find the quickest and easiest way to project player performance, even if I have to resort to "messy" methods to do so. Indeed, messiness seems like an an advantage, not a flaw. If I can use one stat to stand-in for multiple skills, all the better-- it saves me a load of time and effort, and I haven't seen any evidence that your cleaner stats are more predictive. Along the same lines... ...I'm not getting why this is so meaningful. We're concerned about player projections here, and for those purposes, I don't see why these correlations are relevant. My argument is not that 2B/3B should be lumped with singles instead of home runs. My argument is that 2B/3B should be in its own bucket, one that is far closer in YTY variability to 1B than HR. So the fact that a hitter hit a bunch of doubles in one year may mean he "hit the ball hard" that year, but it also means that there's a very good chance he won't hit that many doubles the next year. Edit: Oh, BTW, I did mean to say that the data only includes players who played for the same team in each year. That XBC adds to HRC a noisier group of data (23/C) and hence dilutes it, is absolutely true, and a good argument for why we should still look at HRC. However, and this goes back to Salty, what XBC does capture is the real phenomenon of power manifesting itself sometimes as HR and sometimes as 2B/3B. Mike Napoli also had a freakishly low HRX compared to his career last year, so I think that the Sox may well have run into some factors, both home and road (probably weather) that kept a lot of hard-hit balls in the park. IOW, if a player's HRC drops but his XBC remains the same, that tells you something important. The way to look at this statistically: if there were not a real trade-off between HR and 2B/3B, then the difference in Y2Y correlation between HRC and XBC would be much larger than it is. That it's quite close tells you that XBC is also capturing good and bad luck on balls that were hit hard enough to be homers, and/or stylistic alterations or changes in level of power that turned 2B/3B into HR or vice versa. I think a proper system for normalizing data for luck (like the Sox weather this year) would look at a bunch of estimates predicting the various rates from other rates. Whether that could be codified into a system that identifies what's really going on is unclear. Right now, for me, it's going to be an art form for the time being. You look at the various way the metrics have moved in a given way, knowing how they ordinarily relate to one another, and try to identify what luck factors were at work. However, having said that, I am now wondering whether there might be a statistical way of looking at a career set of HRC and XBC numbers and figuring out the true XBC rate in a given year. The reason why 23/C correlates so piss-poorly is that it is a mixture of (Hard hit balls that might have gone for homers in some other year but didn't, hence noisy) plus (hard hit balls that are legit 2B and should arguably correlate strongly from year to year; true "doubles power") plus (lucky doubles that might have been singles or even outs). What we want is way of dividing up the 2B/3B into the three groups. Hmm, in years in which XBC was inflated or deflated by luck, would we see the opposite effect on 1B%? So far, all I've done is regression analyses, without a lot of deep thought as to what the resulting formulas might mean. That's the next step. Thanks for this discussion, because it has indeed made me think about what each metric means and what it might be measuring.
|
|
|
Post by jmei on Nov 16, 2013 22:50:53 GMT -5
That XBC adds to HRC a noisier group of data (23/C) and hence dilutes it, is absolutely true, and a good argument for why we should still look at HRC. However, and this goes back to Salty, what XBC does capture is the real phenomenon of power manifesting itself sometimes as HR and sometimes as 2B/3B. Mike Napoli also had a freakishly low HRX compared to his career last year, so I think that the Sox may well have run into some factors, both home and road (probably weather) that kept a lot of hard-hit balls in the park. IOW, if a player's HRC drops but his XBC remains the same, that tells you something important. Fair enough, but I still think the majority of that importance is captured by regressing HRC alone to career norms. The fact that a player's XBC stayed the same as well doesn't tell you that much more, I think, at least in the age range that guys like Saltalamacchia and Napoli are in (i.e. age-related decline shouldn't be catastrophic yet). However, having said that, I am now wondering whether there might be a statistical way of looking at a career set of HRC and XBC numbers and figuring out the true XBC rate in a given year. The reason why 23/C correlates so piss-poorly is that it is a mixture of (Hard hit balls that might have gone for homers in some other year but didn't, hence noisy) plus (hard hit balls that are legit 2B and should arguably correlate strongly from year to year; true "doubles power") plus (lucky doubles that might have been singles or even outs). What we want is way of dividing up the 2B/3B into the three groups. Hmm, in years in which XBC was inflated or deflated by luck, would we see the opposite effect on 1B%? I think this is a spot-on distinction and gets to my skepticism of 23C. I agree that there is some element of 23C that gets you to the "hits the ball hard" stat that you want to find, but there's enough noise that I'd rather just rely on some variation on home run rates. This is where even limited HITf/x data would be supremely useful.
|
|
ericmvan
Veteran
Supposed to be working on something more important
Posts: 9,027
|
Post by ericmvan on Nov 18, 2013 2:46:58 GMT -5
It may be unclear that I haven't actually decided anything about any of this. Arguing strongly for one idea over another, for me, is just part of the process of exploration. It helps raise additional questions.
I just, in fact, outlined eleven further steps in this line of research. And here are those notes to myself:
1. Include XBH/BIP (XBP), because I really should be considering HRC / XBP / 1B% as well as the other two ways of breaking down contact.
2. Get my list of new ballparks and recode the teams appropriately (Yankees 2 for the new Yankee Stadium, etc.). Classify each pair of player seasons as same franchise and park, same franchise but different park (Montreal to Washington is special case, ditto Washington to Texas, etc., if we go back that far), different franchise. Now I can see how each metric correlates for those three types (although the middle will be noisy, we can at least see which it more strongly resembles).
3. To control for low-plate PA guys being of fundamentally different quality than high-plate-appearance guys, settle on some upper limit (400 PA or equivalent) and run three sets of Y2Y correlations: both seasons 400+, one season 400+, neither. I'm not sure if that will answer the question, so keep messing around until I'm confident that I have a way of preventing the data from being confounded unnecessarily by such guys.
4. Using 400 or 250 PA as a cutoff is correct for K% and BB% and hence in general, but when investigating sensitivity of the contact metrics to sample size, we should use Contacts; Rob Deer could get 700 PA and his contact metrics would still be SSS.
5. Code each pair of same-player seasons by minimum (PA and) Contacts between them, allowing a permanent sort by those, and hence a graph of Y2Y correlation strength of each metric as function of Contacts. And hence determine empirically where the correlations start to weaken.
(Note that the results of the above may depend on the order that I do the steps! Have to think about that. I may want to do step 5, then 2 through 4, then 5 again.)
6. Now decide whether I need to re-run my within-year XBC regression with a tweaked data set (excluding players who changed parks or including players who changed teams, and/or a higher minimum for PA).
7. Next, derive the within-year predictors for 2008-2013 for the other two models. Compare regression strengths, and maybe how each model predicts overall EqA.
8. Again with the data set properly defined, do a set of Y2Y correlations regressing each of 1BC, 23C, and HRC against their predecessors and the preceding K and BB. Also do K and BB for comparison with the next step. That should be really informative!
9. Do the same thing with each metric set predicting K, BB, and its own set of metrics from itself. That’s fifteen more lengthy regressions …
10. And maybe do a set of regressions predicting 1BC, 23C, and HRC from the lumped sets, which is to say, fifteen more regressions. (Is there any point in seeing how the unlumped metrics predict the lumped ones? I don't think so.)
11. Once you have derived regression models for each metric set, you don’t simply pick one of the metric sets as the best predictor; you collect the actual seasons where one model was much more accurate than another, and see if you can find any pattern there.
Each of the regressions, BTW, is actually a process of selecting 16 variables from a potential twenty, and then doing about 10 successive regressions, each time removing the least-significant variable, until you're left with a set of highly significant ones. And even then, there's no guarantee that you've found the best possible regression. I've just thought of a methodology to do a bit of a double check on that, though.
How much work is all this? Less than you might think, given enough facility with Excel. I'm going to try to do an hour or two each day.
|
|
|