Monday, August 10, 2009

Reminder that 'Data Mining' is a Pejorative

Funny post about datamining by Jason Zwieg:
The stock market generates such vast quantities of information that, if you plow through enough of it for long enough, you can always find some relationship that appears to generate spectacular returns -- by coincidence alone. This sham is known as "data mining."

...

Mr. Leinweber got so frustrated by "irresponsible" data mining that he decided to satirize it. After casting about to find a statistic so absurd that no sensible person could possibly believe it could forecast U.S. stock prices, Mr. Leinweber settled on annual butter production in Bangladesh. Over an 13-year period, he found, this statistic "explained" 75% of the variation in the annual returns of the Standard & Poor's 500-stock index.

By tossing in U.S. cheese production and the total population of sheep in both Bangladesh and the U.S., Mr. Leinweber was able to "predict" past U.S. stock returns with 99% accuracy.

6 comments:

Anonymous said...

Its not a statistical issue, but a linguistic one;

For instance, people now uncover spurious relationships even when circumventing the known textbook issues
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1045281

The word prediction should be reserved to out of sample performance (as in statistical learning community)

Anonymous said...

'Data mining,' although debased, probably isn't consistently pejorative. 'Data dredging' is consistently pejorative, as is 'fishing expedition,' either of which should be preferred when clarity is desired.

Anonymous said...

"The word prediction should be reserved to out of sample performance"

Agree 100%.

Anonymous said...

From a Bayesian perspective though, it's hard not to follow a mined strategy when you find one...

As commenter #2 alludes to, most people interpret "data mining" as some sort of statistical wizardry not as a pejorative.

JG said...

"As commenter #2 alludes to, most people interpret "data mining" as some sort of statistical wizardry not as a pejorative."

"Most people" is an overstatement.

Data mining has a very long and honored history as a perjorative. In recent years some have begun using the term to describe pulling useful information out of a mass of data. That's still very much a minority usage in my experience.

I recently saw a rather heated argument over this on another site. There were a few people there arguing that "data mining" can only be used in a positive, complimentary sense. They were like kids who say "If I didn't see him on ESPN, he couldn't have been that good a player".

Word usages change. This one seems to be changing -- but it hasn't changed that much that fast.

Or maybe I just hang out with old people.

Anonymous said...

So what does the butter-cheese-sheep model say for 2010? And where can I buy an ETF that uses the model?