Gary Marcus and Ernest Davis on the limitations of big data, in the New York Times:
Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.
See also the Language Log commentary. This quote stuck out to me:
Posts here on Language Log (especially those by Mark Liberman) have shown that over and over again, as any regular reader will know. 21st-century linguists would be deeply foolish to stick to typical 20th-century methodology: largely ignoring what occurs, and basing everything on personal intuitions of what sounds acceptable.