I've worked with finance professors on consulting projects, and cherry-picking data recent data and pointing to something 'out of sample' when it is used iteratively is quite common. An important postulate to remember is that there are no true out-of-sample backtests, just tests of subsample stability. Invariably researchers know about the entire dataset in question, so out-of-sample results are really models that when fit on a subsample and applied to its complement generate the best fit. That is, quants try models sequentially until they find one that works well 'out of sample,' which means the data is not really out of sample.
That's not to say out-of-sample tests are meaningless, just that it takes a lot of self-discipline because a lot of this is done outside the box, and the easiest person to fool is often oneself because it's very tempting to believe things when they imply many self-serving benefits. This is why integrity is a virtue, because it's hard, uncommon, and helpful. It's tempting to over-promote your own pet idea as tendentious advocacy can seem necessary in the real world where 'everybody does it.' But, the biggest problem knowledge-workers make is not making a logical error or not being able to solve a complicated problem, but working on something that is a dead-end, because that implies you've just wasted a large part of your career: an expert on input-output models, Keynesian macro models, dynamic programming isn't valuable for making decisions. Fooling yourself into believing in a false model simply wastes your time.
Consider John Cochrane and Monika Piazzesi's Bond Risk Premia paper that purports a model that forecasts one year bond excess returns with a 44% R2. Both are competent academics who I generally respect, as I think they are smart, careful and do research with good faith. Their model suggests that if you look at the current forward rates from the US Treasury yield curve, the first five forwards predict year-ahead bond returns very well (to be precise, these are 'excess' returns, so they subtract the 1-year bond yields).
What is this model? Basically, if you run an ordinary least squares of the 10yr bond return over the next year (minus the 1yr yield), on the forwards. They looked at the 1964-2004 period, which has 467 monthly datapoints, but because these are year-ahead returns, we really only have 39 totally independent datapoints, which is not a very large sample (most year-ahead returns being highly correlated because they share much of the same data). So the basic pattern they found was
year ahead 10yrBondReturn-1yr Bond Yield=a+b1*f1+b2*f2+b3*f3+b4*f4+b5*f5
*here f1-f5 are the 1 through 5 year forwards.
You can download his data here. Now, the first problem I found is that his bond data is a bit fishy. He used bond data from CRSP, and 2 and 4 year USTreasury datapoints are pretty uncommon. His 4 and 5 year forwards yield changes have a suspiciously low correlation. In anycase, I took the H15 data using their 1, 3 and 5 year monthly bond yields, and generated pretty much the same result: tent-shaped set of coefficients on the forwards (approximately equal and negative for 1 and 5 year forwards, positive and larger for the 3 year forward), and my R2 for the 1964-03 period was a large 31%. My results look like this for the same sample period Cochran and Piazzesi use:
The coefficients suggest that there are higher returns the more concave the forwards are, and lower returns the more convex. This doesn't really make any sense, in that there's no intuition as to why this 'tent-structure' of coefficients is related to risk, or utility, it just comes out of a best fit of the data.
If we look at the subsequent 7 years, that same set of coefficients that worked so well in-sample for 64-03, don't work at all for 2004-2010 (last datapoint was for the return from 9/2010 through 9/2011). See below.
So, it seems a classic overfitting of the data. Sure, the pattern could have just stopped, but given the model had no intuition, no causal mechanism, just some unlikely set of coefficients, it almost surely was an overfit. Such results, prior to Freakonomics and Behavioral Finance, were considered rubbish for a while, as the development of CRSP into a data source led to a lot of stupid correlation papers in the 1980s and 70s, but the success of other atheoretical findings (momentum) has unleashed non-intutive correlations into top tier journals.
As an academic, this will always be a plus on Cochrane and Piazzesi's vitas because it made a top tier journal (AEA 2005), but as a practioner this would have gotten them fired. Thus for academics, overfit theories that generate publications have little downside compared to a practioner.
1 comment:
Intuitively, it should be a basic property of economic history that excess return predictions disappear once arbitrageurs are aware of them. Therefore, considering a 1964-2003 test and a 2004-2010 test as a proof for over-fitting is questionable (the other way: 1971-2010 and 1964-1970 would be more like it). Possibly a model with a break-point can be fit and the historical point where the effect was discovered can be pinpointed.
If the theory of effect and arbitrageur fits, it can further be interesting to explain the effect.
In any case, I agree with the care needed with integrity, biases, etc. and that for an active manager, the final test of a theory, may be in the eating of the pudding.
Post a Comment