To estimate the empirical survival curves, we rely on a large and geographically diverse data set from a major financial services firm. Data includes credit ratings for the borrower at the time of loan origination. Inclusion of this important variable helps ensure unbiased estimation of the coefficients of other risk factors, such as current loan-to-value ratio and changes in local unemployment rates. We should also acknowledge data limitations: it only includes loans originated during 1993-1997 time period when house prices in most (though not all) markets were stable or increasing
Sounds great, like they are just modeling cross-sectional risk, dipping their toe in the empirical pool. After all, 4 years, not including any cyclical volatility, that would be irrelevant for modeling a worst-case-scenario, and they realize this.
But then after torturing the data for 30 pages, the authors conclude with:
We find that the current regulatory standards for capital are too high in most cases.
No mention in the conclusion about the lack of a real cycle in their sample data! They knew the data's limitations, but by the end ignored them. In other words, forget about the business cycle--we have a large number of observations! I see this a lot in default modeling, where someone will look at a bunch of daily data on bonds, and say they have 400,000 observations in the default model, ignoring the fact that the ten years of daily IBM data is not 2520 observations, more like 3.
It's a common problem, mistaking the number of observations for degrees of freedom, because the correlation structure underlying the data may actually drastically reduce the degrees of freedom. Making sure your data has the appropriate sample is a big issue in all social science, as often someone will observe how college kids respond to stimuli to predict how people in general respond, assume men are the same as women. Their is no simple cure other than to be thoughtful about the specific application.