Correlation and Auto-correlation - An Easy Primer on 2 Key Concepts

We’ll develop an intuition for what it means for two variables to be correlated with each other, or for a variable to correlated with itself. And we’ll learn about the Pearson’s correlation coefficient and how it is calculated.

In the most general sense, a correlation between two variables is some kind of a relationship between them. When one variable’s value changes, the other one’s value is observed to change in a predictable manner.

A trivially obvious example of a correlation between two variables is an exact mathematical equation between them. Consider the following equation of a straight line between two properties of a motor vehicle: its efficiency on the highway and its efficiency on city roads:

The above equation states that the vehicle’s fuel efficiency on the highway is exactly 20% higher than on city roads. In the above equation, the variables City MPG and Highway MPG are perfectly correlated.

Now I don’t have to tell you that such exact equations between two quantities are Chimeras – they are practically impossible. One of the great many difficulties with such an exact relationship is that for any vehicle, City MPG and Highway MPG act as random variables. Even for the same vehicle, each time you measure its fuel efficiency on the highway and in the city, you’ll get a somewhat different pair of readings for (Highway MPG, City MPG). Thus, Highway MPG will never be exactly 1.2 times or any other constant factor times City MPG for every single value of City MPG.

But that doesn’t mean Highway MPG and City MPG are not correlated in some way. If you plot them, you might see the following type of chart indicating a strong, but patently inexact, linear relationship between them:

So in this case, we assume that Highway MPG and City MPG are linearly correlated.

For such linearly correlated variables, a Linear Regression Model of the following sort will be able to adequately explain the variance in Highway MPG with respect to City MPG:

In the above model, the regression coefficient β is the constant rate at which Highway MPG is presumed to change with respect to City MPG. ‘e’ is the regression error. It’s the difference between the observed value of Highway MPG and the value (β times City MPG) estimated by the model.

The black line shows the model fitted on this data set:

A Linear Regression Model fitted to 80% of the data points in the City versus Highway MPG data set (Image by Author)

Once the linear model is fitted (trained) on your data set, you can use the fitted model to predict Highway MPG values that the model has not seen in the training data set. Here’s how the fitted model’s equation would look like:

In the fitted model, β_cap is the estimated value of the regression coefficient β, and ‘ϵ’ is the residual. ‘ϵ’ is the difference between the observed value of Highway MPG and the value (β_cap times City MPG) estimated by the fitted model.

Here is the plot of the predictions of the fitted Linear Model on a hold-out data set. The hold-out data contains 20% of the data from the original data set and this 20% was left out from the training data set.

Actual versus predicted Highway MPG on the 20% hold-out set (Image by Author)

The following Python code produced these results:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

#Read the data set file
df = pd.read_csv('uciml_auto_city_highway_mpg.csv', header=0)

#Plot the original data set
df.plot.scatter(x='City MPG', y='Highway MPG')
plt.show()

# Create the Train and Test datasets for the Linear Regression Model
X = df.iloc[:, 0:1].values
y = df.iloc[:, 1:2].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Use all the default params while creating the linear regressor
lin_reg = LinearRegression()

#Train the regressor on the training data set
lin_reg.fit(X_train, y_train)

# print out the coorelation coefficient for the training dataset
print('r='+str(lin_reg.score(X_train, y_train)))

# Plot the regression line superimposed on the training dataset
plt.xlabel('City MPG')
plt.ylabel('Highway MPG')
plt.scatter(X_train, y_train, color = 'blue')
plt.plot(X_train, lin_reg.predict(X_train), color = 'black')
plt.show()

# Plot the predicted and actual values for the holdout dataset
plt.xlabel('City MPG')
plt.ylabel('Highway MPG')
actuals = plt.scatter(X_test, y_test, marker='o', color = 'lightblue', label='Actual values')
predicted = plt.scatter(X_test, lin_reg.predict(X_test), marker='+', color = 'black', label='Predicted values')
plt.legend(handles=[predicted, actuals])
plt.show()

You can get the data used in the example from here. If you use this data in your work be sure to do a shout-out to the folks at the UC Irvine ML repository.

Let’s briefly look at nonlinear relationships. If the value of a variable changes, but not at a constant rate, with respect to that of another variable, the two variables are nonlinearly correlated with each other.

Here is an example of what looks like a case of nonlinear correlation.

Example of a nonlinear relationship (Image by Author)

Unless you can transform the dependent variable (Highway MPG) using a function such as log or square root to make the relation linear, you won’t be able to use a Linear Regression Model to model such a nonlinear relationship. Alternatively, you may use one of the many different types of nonlinear regression models such the the Nonlinear Least Squares model, or a suitable member of from the family of Generalized Linear Models, or even a Neural Network model.

Positive and negative correlation

If the values of two variables vary in the same direction they are positively correlated.

Here is some data that suggests a positive correlation between the length and the weight of a passenger vehicle:

Two variables that appear to be positively correlated. Data source: The Automobile Data Set, UC Irvine ML Repository (Image by Author) — Two variables that appear to be positively correlated. Data source: The **Automobile Data Set,** UC Irvine ML Repository (Image by Author)

If a linear model of the type Length = β(Curb Weight) + e were to be fitted on the above data, the regression coefficient β would be positive valued.

On the other hand, if the values of two variables vary in exactly opposite directions, they are negatively correlated.

Here is an example of a negative correlation between the fuel efficiency and the weight of a vehicle. If a linear model of the type Highway MPG = β(Curb Weight) + e were to be fitted on the above data, the regression coefficient β would be negative valued.

Two variables that appear to be negatively correlated. Data source: The Automobile Data Set, UC Irvine ML Repository (Image by Author) — Two variables that appear to be negatively correlated. Data source: The **Automobile Data Set,** UC Irvine ML Repository (Image by Author)

How to measure correlation

Let’s look at the following two scatter plots.

Two examples of positively correlated variables (Image by Author)

Although both plots suggest a positive correlation between the respective variables, the correlation is clearly stronger in the first plot. Can we quantify the strength of this correlation?

Scientists have invented a number of measures to quantify the strength of the correlation between two random variables whose ranges are numerical (as against categorical), and ordered. Chief among them are the Pearson’s coefficient of correlation, Spearman’s rank coefficient, Kendall’s Tau, Distance correlation, Mutual Information, and the Hoeffding’s D.

The Pearson’s coefficient of correlation, also known as the Pearson’s ‘r’ or Pearson’s ‘ρ’ (rho) is the most widely used measure for quantifying linear relationships between two random variables with numerical and ordered ranges.

The formula for the Pearson’s coefficient

The formula for the Pearson’s coefficient of correlation between two variables that have a linear relationship is:

The two sigmas in the denominator are the standard deviations of the respective variables. I’ll dissect the Covariance term in the numerator in a bit. The above formula represents the Pearson’s coefficient of correlation as applied to the population.

When the correlation coefficient is calculated using the data in a sample, the Pearson’s coefficient of correlation is represented by the symbol ‘r’ . In this case, the population standard deviation σ is replaced with the sample standard deviation s. The formula for covariance is also adjusted to reflect sample covariance.

Formula for the Pearson’s coefficient of correlation r between random variables X and Y when calculated for a random sample (Image by Author)

When you use the above formula for calculating the Pearson’s r, you will get a value of r that lies between -1.0 to +1.0. When the two variables are negatively correlated, r will be negative with -1.0 indicating a perfect negative correlation. When the variables are positively correlated, r will be positive with +1.0 indicating a perfect positive correlation. r = 0 signifies no correlation at all between the two variables implying that the values of the two variables are moving completely independently of each other. Clearly, the further away from 0 in either direction is the value of r, the higher is the degree of correlation – positive or negative – between the two variables.

Intuition for the Pearson’s coefficient formula

To really understand what’s going on inside the Pearson’s formula one must first understand covariance. Just like correlation, the covariance between two variables measures how tightly coupled are the values of the two variables.

When used for measuring the tightness of a linear relationship, covariance is calculated using the following formulae:

In the above formulae, X_bar and Y_bar are the sample means of X and Y respectively, while μ_X and μ_Y are the population means of X and Y respectively. n is the number of data points.

Let’s deconstruct these formulae term by term.

Covariance measures how synchronously the values of random variables change w.r.t. each other. Since we want to measure the change in value, we must anchor the change with respect to a fixed value. That fixed value is the mean of that random variable. For the sample covariance, we use the sample mean, and for the population covariance, we use the population mean as the anchor. The mean as the goal post allows us to center each observed value of the random variable around this value. Thus, (x_i – x_bar) and (y_i – y_bar) are the mean-centered versions of the data point (x_i, y_i).

The multiplication of the mean-centered values in the numerator ensures that the product is positive when X and Y both rise or both fall at the same time with respect to their respective means. If X rises but Y falls below the respective mean or vice-versa the product (x_i – x_bar)(y_i – y_bar) is negative.

If the positive valued product terms more or less balance off the negative valued products, their summation in the numerator will be a tiny number close to 0. That would imply that there is no dominant positive or negative pattern in the way the two variables are moving w.r.t. each other. In that case the covariance between the two random variables will be small. On the other hand if the positive products dominate over the negative products or vice-versa, then the sum will be respectively a significant positive or negative number signifying a net positive or negative pattern of movement between the two variables.

Finally, the n or the (n-1) in the denominator averages things out over the available degrees of freedom. When the covariance is calculated for a random sample, one degree of freedom is used up by the sample mean, so we divide by (n-1).

Covariance is wonderful, but…

Covariance is a wonderful technique to quantify the movement of variables with respect to each other but it has some problems.

Difficulty with interpreting units of covariance: Covariance is difficult to interpret when the units of the two variables are different. For instance if X is in dollars and Y is in pound-sterling the unit of Covariance(X, Y) is dollar times pound-sterling. How do you even begin to interpret that? Even when both X and Y have the same unit, say dollar, the units of covariance becomes…dollar squared. Dollar² is certainly not any easier to interpret than dollar times pound-sterling.

The problem of scale: When X and Y vary over a small interval, say [0,1] you will get a deceptively tiny covariance reading even if X and Y move together very tightly.

Difficulty with comparing two covariances: Because X and Y can have different units and a different range, it is often impossible to objectively compare the covariance between one pair of variables with that of another pair of variables. For instance, consider the following question:

Is the vehicle’s fuel economy more strongly correlated with it’s length than with its curb weight?

To try to answer this question, you would calculate the covariance of fuel economy and length and get a covariance value with units MPG-meters. Next, you’d calculate the covariance of fuel economy with curb weight which will yield a value with units MPG-KG. Moreover, each covariance reading will vary across a different unique range of values. How do you compare such covariance values with different ranges and units? It will be problematic, to say the least.

If only we could re-scale the covariance to standardize the range and also somehow solve the differing ‘units’ problem.

Enter ‘standard deviation’. In simple terms, standard deviation measures the average departure of the data from its mean. Standard deviation also has the nice property of having the same unit as the original variable. For example, the standard deviation of fuel economy has the unit MPG and that of curb weight has the unit Kilograms. Isn’t that neat?

So let’s divide the covariance by the product of the standard deviations of the two variables. Doing so will do two things:

It will re-scale the covariance so that it is now expressed in multiples of standard deviation.
Division of the covariance by standard deviation will also cancel out the units of measurement from the numerator so that the net result becomes a dimension-less quantity. For example, the units MPG-KG, which are the units of the covariance of fuel economy and curb weight are canceled out when divided by the product of the standard deviations of fuel economy and curb weight which have units MPG-KG.

In a single mathematical masterstroke, we’ve eliminated both troubles we had with covariance. Here is the resulting formula:

Now where have we seen this formula before? It is the Pearson’s correlation coefficient!

Auto-correlation

Auto or self correlation is the correlation of a variable with a lagged version of itself. Auto-correlation is meaningful for variables whose measurement scale is numerical and ordered. Time indexed variables such as air temperature, stock prices or interest rates fall in this category. For example air-temperature of a place might be (auto)correlated with the air temperature of the same place during the same month of the previous year. Contrast this with a categorical variable such as Cuisine with values American, Chinese, Indian, Italian and Mexican. It’s not meaningful to calculate the autocorrelation of Cuisine.

Just like correlation, auto-correlation can be linear or nonlinear. It can be positive or negative. And it can be zero (not auto-correlated) or 1.0 (perfectly auto-correlated).

The formula for auto-correlation of a variable Y_i with a k-lagged version of itself is as follows:

Notice how similar the structure of the above formula is with that of Pearson’s r.

Let’s develop our understanding of auto-correlation a little further by looking at another data set:

Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019. Weather data source: National Centers for Environmental Information (Image by Author)

The above plot shows the average maximum temperature for each month from January 1998 through June 2019 in the city of Boston, Massachusetts.

Let’s plot the temperature against a time lagged version of itself for various lags. Each lag represents one month in the past.

Monthly average maximum temperature of Boston, MA plotted against a lagged version of itself. Data source: National Centers for Environmental Information (Image by Author)

The LAG 12 plot shows a strong positive linear relationship between the average maximum temperature for a month and the average maximum of the same month one year ago.

There is also a strong negative auto-correlation between data points that are six months apart i.e. at LAG 6. This is undoubtedly due to the change of the season.

Overall there is a strong seasonal signal in this data as one might expect to find in weather data of this kind.

In the above plot, we should focus only on boxes where the correlation is linear i.e. roughly a straight line with a positive or negative slope. For cases where the correlation is nonlinear such as at lags 2, 3, 4, 8, 9, and 10, even a large positive or negative value for the auto-correlation coefficient carries no interpretive value.

For the temperatures data, the following auto-correlation heat map shows the correlation between every combination of T and T-k. The column of interest to us is outlined in blue.

Within the first column, the box of interest is the one at (Monthly Average Maximum, TMINUS12) indicating the correlation of temperature with its value from 12 months ago. We see that the correlation coefficient for this box is in excess of +0.8 indicating a strong positive correlation. Similarly, the box for (Monthly Average Maximum, TMINUS6) shows a strong negative correlation in an excess of -0.8.

Here is the Python code for plotting the temperature time series, the scatter plot collage and the heat map:

	import matplotlib.pyplot as plt
	import pandas as pd
	from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
	import seaborn as sns


	df = pd.read_csv('boston_monthly_tmax_1998_2019.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
	df.plot(marker='.')
	plt.show()

	for i in range(1, 12+1):
	df['TMINUS' + str(i)] = df['Monthly Average Maximum'].shift(i)

	for i in range(1, 12+1):
	ax = plt.subplot(4, 3, i)
	ax.set_title('LAG ' + str(i), fontdict={'fontsize': 10})
	plt.scatter(x=df['Monthly Average Maximum'].values, y=df['TMINUS' + str(i)].values, marker='.')
	plt.show()

	corr = df.corr()
	sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns)
	plt.show()

view raw seasonal_time_series.py hosted with ❤ by GitHub

Python code for plotting the temperature series, auto-correlation scatter plots and correlation heat map

And here is the link to the temperature data set.

Finally, a word of caution

A correlation between two variables X and Y, whether it is linear or nonlinear, does not automatically imply a cause-effect relationship between X and Y. Even when there is a large correlation observed between X and Y, X may not be directly influencing Y or vice versa. There could be a hidden variable Z, called a confounding variable, that is simultaneously influencing both X and Y. With Z playing the role of the behind-the-curtain puppet master, X and Y would appear to rise and fall in sync with each other but what they are really rising and falling in sync with is Z.

To illustrate how difficult it can be to determine whether real world data has any cause-effect relationships, consider the following graph that shows two data sets plotted against each other.

Plot of total labor force against percentage of people with access to electricity (Data source: World Bank) (Image by Author)

The random variable X is a time series that ranges from 1990 to 2016 and contains the fraction of the world’s population that had access to electricity in each of those years. The random variable Y is also a time series that ranges from 1990 to 2016 and contains the size of the world-wide labor force in each of those years.

The two variables are obviously highly correlated. The pertinent question is this: In any given region of the world, if the percentage of people having access to electricity increases, does it directly cause an increase in the size of the labor force in that region? Alternatively, if the size of the labor force in some region of the world increases, will that directly lead to an increase in the access to electricity to people within that region? Or are confounding economic variables hidden behind the curtain who are acting as the real causes? You be the judge.

Citations and Copyrights

Data set

Automobile MPG Data set: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download curated data set