“…correlation is charlatanism”

*Photo: AP photo/Richard Drew*

“Anything that relies on correlation is charlatanism” is a great article. But is correlation charlatanism? Yes it is. But it is not for the reasons explained in the article. Here is why.

Correlations are present everywhere. The concept of correlation is one of the key constructs of statistics, modelling, simulation. It is used to design portfolios, to estimate risks, in engineering design, in decision-making in biomedical research, in Big Data analysis, VaR, etc., etc. Basically everywhere.

In various industries simulations (Monte Carlo Simulations) are often used to ‘predict’ the costs and risks of a given project (construction of a plant, drilling in search of natural resources, construction of a motorway, etc.). Simulation models include input variables and, clearly, outputs, such as cost, risk, duration, etc. Input variables may be correlated or not. In many cases, ignoring correlation between inputs leads to serious mis-assessment of risk. The effect of excluding correlations is often more profound than the effect of the choice of distributions.

In very large projects, involving thousands of variables, the number of input-input correlations may be huge. Neglecting them will surely have potentially devastating effects. But that is just one problem.

The other key issue is how correlations are computed. Independently of input-input correlations, or input-output (or output-output) correlations, these must be computed correctly. A correlation expresses how strongly two variable are interdependent. Mind you, we’re not concerned with the issue of causality here. That is a totally different matter. The point here is to measure the ‘intensity of the interdependency between two variables’.

In our OntoNet QCM engine, instead of computing correlations based on variance we use an *entropy-based* measure of correlation, called *generalized correlation.* Unlike variance which measures concentration only around the mean, the entropy-based generalized correlation takes into account the actual *distribution of data*. An example of what we mean is illustrated below. An innocently looking example is shown, where data with a linear look and feel is analyzed. The linear correlation coefficient is 0.92, the generalized one is 0.76, a full 16% less!

So, in the case shown above, the linear correlation coefficient is simply wrong. It induces false optimism, especially when data looks and feels linear-ish. In such cases linear correlation is excessively ‘democratic’. It assigns equal importance to a few scattered and distant points, neglecting the fact that most of the data is clustered somewhere else. Examine the two simple cases shown below.

In both cases the linear correlation coefficient is 100%. In both cases the regression model (providing you need one) is the same, a straight line. And yet the physics behind the two cases is different. In the case on the right there are two clusters which points to a bifurcation, which in turn indicates discontinuities and non-linearities. If you don’t actually see the data you may be fooled into thinking that both situations are identical. Well, they are not. They represent two totally different systems. But then, who cares about physics.

The lesson here is very simple: linear correlation may be used *only* when it may be used. If you don’t actually look at your data, if you don’t analyze it ‘visually’ (the process is sometimes called ‘chi-by-eye’) you could be in serious trouble.

In the case shown below, which is equally dramatic, the difference between a conventional linear correlation and a generalized correlation is well over 90%! In actual fact, conventional correlation says that the two variables in question are independent while in reality there are very strongly interdependent.

So far, we’ve been considering innocently looking data – lines and circles. Now take a look at this (this is *real* data from stock markets, showing the price of one stock versus another):

Think how much damage can linear correlation do in such cases.

Consider now that in your simulation model, or a portfolio design system, there are tens of thousands of correlations. Suppose that they are all off by 5% or 10% or even 20%. Think of the consequences. In general, overly optimistic values of correlation will yield less risk, less spread in terms of (project) costs, less uncertainty. You will think that the cost of your project is under control, that you can deliver on time and that the risk of things going wrong is acceptable. You’re walking on ice which is thinner than you think.

The point here is not just to compute risk more accurately. Risk is not a well-defined quantity so how can it be computed accurately in the first place? The point is to avoid *gross* mis-calculations. Why are (almost) all construction projects late, comprising a huge cost overrun? Sydney Opera House was completed ten years late and more than fourteen times over budget. Other examples include the F35 fighter, Berlin Brandenburg Airport, UK’s NHS Connecting for Health project (“This is the biggest IT project in the world and it is turning into the biggest disaster.”). Huge project complexity is certainly one cause. Optimistic values of correlation in cost and risk estimation models is another. More soon.

Pingback: Project Management: Why Cost Overruns and Late Delivery? | Ontonix QCM Blog

Pingback: Trump, Brexit and the Failure of Predictive Analytics | Ontonix QCM Blog

Pingback: Trump, Brexit and the Failure of Predictive Analytics – Universal Ratings