[Link] Big Data Hype (and Reality)

Link: http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html

Most applications of data mining and analysis have been, at their hearts, attempts to get better at prediction. Decision-makers want to understand the patterns in the past and the present in order to anticipate what is most likely to happen in the future. As big data offers unprecedented awareness of phenomena — particularly of consumers’ actions and attitudes — will we see much improvement on the predictions of previous-generation methods? Let’s look at the evidence so far, in three areas where better prediction of consumer behavior would clearly be valuable.


It took about three years before the BellKor’s Pragmatic Chaosteam managed to win the prizewith a score of 0.8567 RMSE. The winning algorithm was a very complex ensemble of many different approaches — so complex that it was never implemented by Netflix. With three years of effort by some of the world’s best data mining scientists, the average prediction of how a viewer would rate a film improved by less than 0.1 star.


With the benefit of big data, will marketers get much better prediction accuracy?

A study [pdf] that Brij Masand and I conducted would suggest the answer is no. We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.

[…] Finally, let’s turn to the challenge of predicting the click-thru rate (CTR%) of an online ad — clearly a valuable thing to get right, given the sums changing hands in that business. […]

The average CTR% for display ads has been reported as low as 0.1-0.2%. Behavioral and targeted advertising have been able to improve on that significantly, with researchers reporting up to seven-fold improvements. But note that a seven-fold improvement from 0.2% amounts to 1.4% — meaning that today’s best targeted advertising is ignored 98.6% of the time.

What are we to conclude from these three areas — all of them problems with fine, highly motivated minds focused on them? To me, they suggest that the randomness inherent in human behavior is the limiting factor to consumer modeling success. Marginal gains can perhaps be made thanks to big data, but breakthroughs will be elusive as long as human behavior remains inconsistent, impulsive, dynamic, and subtle.

Tags: ,

  1. jsalvatier’s avatar

    That last one seems like a counter evidence to me. The problem seems really hard to me, so I’m not surprised CTRs are low. A 7 fold improvement is gigantic and the kind of improvement that suggests further gains are possible.

Comments are now closed.