Correlation, Causation, and Big Data

    

big data management

What do you do with so much information you can't manually analyze it? This is the big question that surrounds big data.

Currently, most algorithms we have work along the lines of search functions to quantify the frequency of occurrences and compare this with the frequency of other occurrences. This ends up giving us correlations and, as we've all heard so often it's become something of a golden rule, correlation does not show causation.

Consider for a moment, however, the correlation of correlation with causation. If one thing causes another, the two will be correlated. Two things that are correlated aren't necessarily related by cause. Because one is a subset of another, we would be wrong to disregard correlations when we find them. We can't draw a causal conclusion, but we have still found useful information.

Some correlations are so deeply ingrained in us that we fail to notice them. If a friend calls me and asks me if a fever is a symptom of the flu, I'm going to assume my friend is experiencing an elevated temperature. Then we only need to look at the frequency of one variable to make a conclusion. This is, in fact, exactly what Google did to detect the spread of the flu in a Nature paper. They simply looked at the frequency of search queries related to flu symptoms. After all, we assume the query about flu symptoms will generally be spurred on by the onset of early symptoms. Of course, a single person could be wrong—maybe it's not the flu but food poisoning. Maybe it's hypochondria, but where you start seeing search queries pop up within limited regions, you can detect early outbreak, which means you can also get a head start on early intervention.

No one is discussing the cause of the flu. The paper published by Google had no need to examine what, exactly, flu symptoms are, what causes it, or what people should do when they're sick. It simply looked at a trend, and that trend is useful. Correlation might not always show causation, but correlations and trends are extremely useful to anyone trying to make a prediction about future circumstances based on the past. This method of reasoning has been around for many centuries (predicting the movement of the stars based on what they've done before, predicting seasons, etc.), but it was defined as an evidence-based form of reasoning by Ray Solomonoff when he published a paper on inductive reasoning in 1960.

Google's finding about the flu got many people thinking about what other things could be assessed using big data and search queries. Marketing tactics often rely heavily on search queries and other correlative factors, whether or not we fully understand them. For example, Google Trends shows an annual cycle of car sales. We can conjecture all we want about what's causing this trend, and in the end we all must admit we don't know. However, that doesn't negate the fact that the trend is reliable and the information is useful for someone wondering when a good time of year is to offer a car sale blowout. (Apparently more people buy cars in August than in June and that's been true every year for the past decade.)

Of course, the problem with correlations is that those related by cause-and-effect are only a subset. There are many spurious correlations or correlations both caused by the same unknown factor. Big data needs to be carefully treated in order to ask the right questions. By changing the nature of the question we use to query big data, the results can change drastically.

With big data comes great responsibility. Until more advanced algorithms are developed to help us consistently address questions we use to analyze vast amounts of data, we must continue to rely on the expertise of data scientists to ask the right questions and draw the correct conclusions.

hadoop