The Intuition Behind Correlation

What does it really mean for two variables to be correlated? We’ll answer that question in this section. We’ll also develop an intuitive feel for the equation for Pearson’s correlation coefficient.

When you dive into the sea of knowledge that is data science, one of the first fish you spot is correlation and its cousin, auto-correlation. Unless you take some time out to get to know them, it is impossible to get much done in data science. So let’s get to know them.

In the most general sense, a correlation between two variables can be thought of as some kind of a relationship between them. i.e. when one variable’s value changes, the other one’s value changes in a predictable manner, most of the time.

In practice, the word correlation is usually used to describe linear relationships (and sometimes, nonlinear relationships) between variables.

I’ll come to the linearity aspect in a minute.

Meanwhile, here is an example of two possibly correlated variables. We say ‘possibly’ because it is a hypothesis that must be tested and proven.

Let’s draft a few informal definitions.

Linear relationships

Linear Correlation: If values of two correlated variables change at a constant rate with respect to each other they are said to have a linear correlation with each other.

If the correlation in this case is linear, a Linear Regression Model (i.e. a straight line), upon being fitted to the data, ought to be able to adequately explain the linear signal in this data set. Here is how the fitted model (black line) would look like for this data set:

In the above example you can now use the fitted model to predict Highway MPG values corresponding to City MPG values that the model has not seen but which are within the range of the training data set.

Here is the plot of the predictions of the fitted Linear Model on a hold-out data set which contains 20% of the original data that the model did not see during the fitting process.

For the programmatically inclined, the following Python code produced these results:

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

#Plot the original data set
df.plot.scatter(x='City MPG', y='Highway MPG')
plt.show()

# Create the Train and Test datasets for the Linear Regression Model
X = df.iloc[:, 0:1].values
y = df.iloc[:, 1:2].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Use all the default params while creating the linear regressor
lin_reg = LinearRegression()
#Train the regressor on the training data set
lin_reg.fit(X_train, y_train)

# print out the coorelation coefficient for the training dataset
print('r='+str(lin_reg.score(X_train, y_train)))

# Plot the regression line superimposed on the training dataset
plt.xlabel('City MPG')
plt.ylabel('Highway MPG')
plt.scatter(X_train, y_train, color = 'blue')
plt.plot(X_train, lin_reg.predict(X_train), color = 'black')
plt.show()

# Plot the predicted and actual values for the holdout dataset
plt.xlabel('City MPG')
plt.ylabel('Highway MPG')
actuals = plt.scatter(X_test, y_test, marker='o', color = 'lightblue', label='Actual values')
predicted = plt.scatter(X_test, lin_reg.predict(X_test), marker='+', color = 'black', label='Predicted values')
plt.legend(handles=[predicted, actuals])
plt.show()
```

You can get the data used in the example from here. If you use this data in your work be sure to do a shout-out to the folks at the UC Irvine ML repository.

Nonlinear relationships

Now let’s look at nonlinear relationships.

Nonlinear correlation: If the values of correlated variables do not change at a constant rate with respect to each other they are said to have a nonlinear relationship or a nonlinear correlation with each other.

Here is an example of what looks like a case for nonlinear correlation.

Unless one transforms the dependent variable (in our example — it is Highway MPG) so as to make the relation linear, a Linear Regression Model will not be able to adequately ‘explain’ the information contained within such nonlinear relationships.

Positive and negative correlation

Positive Correlation: For two correlated variables, when one variable’s value increases (or decreases), then most of the time if the other variable’s value is also seen to respectively increase (or decrease), then the two variables can be said to be positively correlated.

Here is an example that suggests a positive correlation between the two variables:

Negative Correlation: For two correlated variables, when one variable’s value increases (or decreases), then most of the time if the other variable’s value is seen to respectively decrease (or increase), then the two variables are said to be negatively correlated.

Here is an example that suggests a negative correlation:

How to measure correlation

Let’s look at the following two scatter plots.

Both plots seem to suggest a positive correlation between the respective variables. But the correlation is stronger in the first plot as the points are more tightly packed together along an imaginary straight line slicing through the points.

The coefficient of correlation between two variables quantifies how tightly coupled are the movements of the two variables with respect to each other.

The formula for the Pearson’s coefficient

The formula for the coefficient of correlation between two variables that have a linear relationship is:

The two sigmas in the denominator are the standard deviations of the respective variables. We’ll dissect Covariance in a bit.

Meanwhile note that when calculated using the above formula, the coefficient of correlation is called the Pearson’s coefficient of correlation. It is represented by the symbol ‘r’ when used for the sample and by the symbol rho when used for the entire of population of values.

If you want to use the ‘population version’ of this formula be sure to use the ‘population formulae’ for the covariance and standard deviation.

Interpreting the value of r

— The value of r (or rho) ranges smoothly from [-1.0 to 1.0].
— When the variables are negatively correlated r=[-1, 0)
— r=-1 implies a perfect negative correlation.
— When they are positively correlated r=(0, +1]
— r=+1 implies a perfect positive correlation.
— When r = [0], the variables are not linearly correlated.

Now let’s get back to understanding the covariance term in the numerator.

Intuition for the Pearson’s coefficient formula

To really understand what’s going on inside the Pearson’s formula one must first understand covariance. Just like correlation, the covariance between two variables measures how tightly coupled are the values of the two variables.

When used for measuring the tightness of a linear relationship, covariance is calculated using the following formulae:

Let’s break down these formulae term by term:

As mentioned before, covariance measures how synchronously the values of variables change w.r.t. each other. Since we want to measure the change in value, the change must be anchored with respect to a fixed value. That fixed value is the mean of that variable’s data series. For the sample covariance, we use the sample mean, and for the population covariance, we use the population mean. Using the mean as the goal post also centers each value around it’s mean. This explains the subtraction of X and Y from their respective means in the numerator.

The multiplication of the centered values in the numerator ensures that the product is positive when X and Y both rise or both fall with respect to their respective means. If X rises but Y falls below the respective mean, the product is negative.

The summation in the numerator ensures that if the positive valued products more or less balance off the negative valued products, the net sum is going to be a tiny number implying that there is no dominant positive or negative pattern in the way the two variables are moving w.r.t. each other. In this case the covariance value will be small. On the other hand if the positive products dominate over the negative products then the sum will be a large positive or a large negative number signifying a net positive or a net negative pattern of movement between the two variables.

Finally, the n or the (n-1) in the denominator averages things out over the available degrees of freedom. In the sample, one degree is used up by the sample mean so we divide by (n-1).

Covariance is wonderful, but…

Covariance is a wonderful way to quantify the movement of variables with respect to each other but it has some problems.

Differing units: Covariance is difficult to interpret when the units of the two variables are different. For instance if X is in dollars and Y is in pound-sterling the unit of covariance between X and Y becomes dollar times pound-sterling. How can one possibly interpret that? Even when both X and Y have the same unit, say dollar, the units of covariance becomes…dollar times dollar! Still not easy to understand. Bummer!

Differing scales: There is also the problem of range. When X and Y vary over a small interval, say [0,1] you will get a deceptively tiny covariance value even if X and Y move together very tightly.

Difficulty with comparison: Because X and Y can have different units and a different range, it is often impossible to objectively compare the covariance between one pair of variables with that of another pair of variables. Say I want to compare how much stronger or weaker is the linear relation between a vehicle’s fuel economy and it’s vehicle length, as compared to the relation between the fuel economy and curb weight. Using covariance to do this comparison will me require to compare two values in two different units and two different ranges. Problematic, to say the least.

If only we could re-scale the covariance so that the range is standardized and also solve it’s ‘units’ problem. Enter ‘standard deviation’. In simple terms, standard deviation measures the average departure of the data from its mean. Standard deviation also has the nice property that it has the same unit as the original variable. So let’s divide the covariance by the standard deviations of the two variables. Doing so will re-scale the covariance so that it is now expressed in multiples of standard deviation, and it will also cancel out the units of measurement from the numerator. All troubles with covariance solved in two simple divisions! Here is the resulting formula:

Now where have we seen this formula before? It is of course the Pearson’s correlation coefficient!

Auto-correlation

Auto or self correlation is the correlation of a variable with a value that the variable took on, X units (of time) in the past. For example air-temperature of a place might be auto-correlated with the air temperature of the same place 12 months ago. Auto-correlation has meaning for variables which are indexed to a scale that can be ordered, i.e. an ordinal scale. The time scale is an example of an ordinal scale.

Just like correlation, auto-correlation can be linear or nonlinear, positive or negative, or it can be zero.

The formula for auto-correlation when used for a linearly auto-correlated relationship between a variable and a k-lagged version of itself is as follows:

Let’s develop our understanding of auto-correlation a little further by looking at another data set:

The above plot shows the monthly average maximum temperature of the city of Boston. It is calculated by averaging over each month, the daily maximum temperature recorded by a weather station in that month, taken over a period that stretches from January 1998 through June 2019.

Let’s plot the temperature against a time lagged version of itself for various lags.

The LAG 12 plot shows a strong positive linear relationship between the average maximum temperature for a month and the average maximum of the same month one year ago.

There is also a strong negative auto-correlation between data points that are six months apart i.e. at LAG 6.

Overall there is a strong seasonal signal in this data as one might expect to find in weather data of this kind.

Following is the auto-correlation heat map showing the correlation between every combination of T and T-k. For us the column of interest is outlined in blue.

Within the first column, the square of interest is the one at (Monthly Average Maximum, TMINUS12) and maybe the one at (Monthly Average Maximum, TMINUS6). Now if you refer back to the scatter plot collage, you will notice that the relationship for all other combinations of lags is nonlinear. So in any linear seasonal model we will attempt to build for this data, the utility of the correlation coefficient values that were generated for these nonlinear relationships (i.e. for the remaining squares in the heat map) is severely limited and they should not be used even if some of them have large values.

Remember that (auto)correlation coefficients, when calculated using the formulae that were mentioned earlier are useful only when the relationship is linear. If the relationship is nonlinear we need a different method to quantify the strength of the nonlinear relationship. For example, the Spearman’s rank correlation coefficient can be used to quantify the strength of the relationship between variables that have a nonlinear, monotonic relationship.

Here is the Python code for plotting the temperature time series, the scatter plot collage and the heat map:/media/115bbba8a2199948304985ff108349c9Python code for plotting the temperature series, auto-correlation scatter plots and correlation heat map

And here is the data set.

A word of caution

Finally, a word of caution. A correlation between two variables X and Y, whether it is linear or nonlinear does not automatically imply a cause-effect relationship between X and Y (while the reverse is true). Even when there is a large correlation seen between X and Y, X may not be directly influencing Y or vice versa. Maybe there is a hidden variable, called a confounding variable, that is simultaneously influencing both X and Y so that they rise and fall in sync with each other. For illustration, consider the following graph that shows two data sets plotted against each other.

Here X is a time series that ranges from 1990 to 2016 and contains the fraction of the world’s population had access to electricity in each of those years. The variable Y is also a time series that ranges from 1990 to 2016 and contains the strength of the world-wide labor force in each of those years.

The two data sets are obviously highly correlated. You be the judge of whether there is any cause and effect!

Related

Understanding Partial Auto-correlation And The PACF