We’ll look at how to model noise, and how to find out if your data is for all practical purposes, just noise
White noise are variations in your data that cannot be explained by any regression model.
And yet, there happens to be a statistical model for white noise. It goes like this for time series data:
The observed value Y_i at time step i is the sum of the current level L_i and a random component N_i around the current level.
If the extent of random variation is proportional to the current level, then we have the following multiplicative version of the same model:
If the current level L_i is constant for all i, i.e. L_i = L for all i, then the noise will be seen to fluctuate around a fixed level.
It’s easy to generate a white noise data set. Here’s how to do it in Excel:
And here is the output plot of noise that is fluctuating around a constant level of 100:
The current level L_i often changes in response to real world factors. For example, if L_i changes linearly in response to a set of regression variables X, then we get the following linear regression model:
In the above equation, β is the vector of regression coefficients and X_i is a vector of regression variables.
Why is it important to study the white noise model?
There are three reasons why:
- If you discover using some techniques which I will describe soon, that your data is basically white noise around a fixed level, then the best that you can do is fit a model around that fixed level. It will be a waste of time to try to do anything better than that.
- Suppose you have already fitted a regression model to a data set. If you are able to show that the residual errors of the fitted model are white noise, it means your model has done a great job of explaining the variance in the dependent variable. There is nothing left to extract in the way of information and whatever is left is noise. You can pat yourself on the back for a job well done!
- Thirdly, the white noise model happens to be a stepping stone to another important and famous model in statistics called the Random Walk model which I will explain in the next section.
The Random Walk Model
Let’s again look at the White Noise Model’s equation:
If we make the level level L_i at time step i be the output value of the model from the previous time step (i-1), we get the Random Walk model, made famous in the popular literature by Burton Malkiel’s A Random Walk Down Wall Street.
The Random Walk model is like the mirage of the Data Science dessert. It has lured many profit-thirsty investors into betting (and losing) their shirt on illusions of trends in stock price movements, movements that were in reality little more than a random walk.
Here’s a plot of data that was generated using the Random Walk model:
Just tell me you don’t see any trends in this plot!
If you are not completely convinced that the above data can be generated by a purely random process, let’s puff away any remaining illusions by showing how to generate this data in Excel:
Let’s look at how we can make use of our knowledge of white noise and random walks to try to detect their presence in time series data.
How to detect white noise in a time series data set
We’ll look at 3 tests to determine whether your time series is in reality, just white noise:
- Auto-correlation plots
- The Box-Pierce test
- The Ljung-Box test
Testing for white noise using auto-correlation plots
When two variables move up or down in unison (or if one value goes up, the other one goes down), they are said to be positively (or negatively) correlated. The correlation coefficient can be used to measure the degree of linear correlation between two such variables:
In the above formula, E(X) and E(Y) are the expected (i.e. mean) values of X and Y. σ_X and σ_Y are the standard deviations of X and Y.
In time series data, correlations often exist between the current value and values that are 1 time step or more older than the current value, i.e. between Y_i and Y_(i-1), between Y_i and Y_(i-2) and so on. Stock price changes often show such patterns of positive and negative correlations (and beware, so do data containing random walks!).
Because the values are correlated with past versions of themselves, we call them auto, meaning self correlated.
Here is the formula for calculating the auto-correlation coefficient between Y_i and Y_(i-k):
Before we can show how this auto-correlation coefficient r_k can be used to detect white noise, we need to take a short and pleasant side-trip into the land of random variables. I’ll explain why r_k is a normally distributed random variable and how this property of r_k can be used to detect white noise.
Distribution of the LAG-k auto-correlation coefficient r_k
For any lag k, r_k is a normally distributed random variable with some mean µ_k and variance σ²_k.
To understand why, consider this thought experiment:
- Take a time series data set containing 100,000 time points.
- Draw 5000 randomly selected samples from this data set. Suppose each sample is of length 100 continuous time points.
- For each sample, calculate the LAG-1 auto-correlation coefficient r_1 using the above formula for r_k.
- One can see that each time, r_1 will come out to be some value between 0 and 1 for each sample of 100 time points. So we end up with 5000 values of r_1, each a number between 0 and 1. Thus r_1 is a random variable for which we have measured 5000 values.
- By appealing to the Limit Theorems of statistics, it can be shown r_1 is a normally distributed random variable, and the distribution of r_1 is centered at some population mean, we’ll call it µ_1, and some variance, we’ll call it σ²_1. In practice, the observed mean and variance of r_1 will be somewhere close to the mean of the 5000 values of r_1 which we measured.
- By repeating the above experiment for all lags k, it can be shown that auto-correlation coefficients for all lags are normally distributed random variables with mean µ_k and variance σ²_k.
Implications for detecting white noise
If the time series is white noise, then in theory, its current value T_i ought not be correlated at all with past values T_(i-1), T_(i-2) etc, and the corresponding auto-correlation coefficients r_1, r_2,…etc. will be zero or close to zero.
i.e.when the time series is white noise, r_k is 0 for all k = 1, 2, 3,…
But we have just seen that r_k is a N(µ_k, σ²_k) random variable.
Putting the above two facts together, we arrive at the following first important implication:
If the time series is white noise, then the auto-correlation coefficient r_k for all lags k will have a zero mean and some variance σ²_k.
But what about the variance σ²_k of the coefficients r_k?
σ_k = 1/sqrt(n)
Where n is the same size. Recollect that in our thought experiment, n was 100.
Thus, we know that r_k under white noise conditions has the following distribution:
An important property of the normal distribution is that approximately 95% of it lies within 1.96 standard deviations from the mean. In our case, the mean is 0 and standard deviation is 1/sqrt(n), so we get the following 95% confidence interval for the auto-correlation coefficients:
These results yield the following procedure for conducting the white noise test using the auto-correlation coefficients r_k:
- Calculate the first k auto-correlation coefficients r_k. k can be set to some high enough value depending on the length n of the time series data set.
- Calculate the 95% confidence interval [ — 1.96/sqrt(n), +1.96/sqrt(n)].
- If for all k, if r_k lies within the above confidence interval, conclude at a 95% confidence level that the time series is in reality, possibly just white noise. We say possibly because if we experiment with larger sample sizes, i.e. larger n, the size of the confidence interval will shrink, and values of r_k that were previously inside the 95% bounds will now lie outside the 95% bounds.
- If any of the r_k lie outside the confidence interval, then the time series possibly has information in it.
Example: White noise detection using Python
Let’s illustrate the above procedure using a real world time series of 5000 decibel level measurements taken at a restaurant using the Google Science Journal app.
The restaurant decibel levels data set can be downloaded from here.
We’ll use the pandas library to load the data set from the csv file and plot it:
import pandas as pd import numpy as np from matplotlib import pyplot as plt df = pd.read_csv('restaurant_decibel_level.csv', header=0, index_col=)
Let’s print the top 10 rows:
Let’s plot all 5000 values in the series:
Let’s fetch and plot the auto-correlation coefficients for the first 40 lags. We’ll the statsmodels library to do that.
import statsmodels.graphics.tsaplots as tsa tsa.plot_acf(df['Decibel'], lags=40, alpha=0.05, title='Auto-correlation coefficients for lags 1 through 40')
The alpha=0.05 tells statsmodels to also plot the 95% confidence interval region. We get the following plot:
As we can see, the time series contains significant auto-correlations up through lags 17. Incidentally, the auto-correlation at lag 0 is always 1.0 as a value is always perfectly correlated with itself.
There is wave-like pattern in the auto-correlation plot that indicates that there could be some seasonality contained in the data. We can try to identify and isolate the seasonality by decomposing the time series into the trend, seasonality and noise components.
For now we’ll focus on the noise portion. The bottom line is that this time series, in its current form, does not appear to be pure white noise.
Next, we’ll two more tests on the time series to confirm this.
The Chi-squared test for white noise detection
The Chi-squared test is based on this powerful result in statistics: the sum of squares of k identical standard normal random variables is a Chi-squared distributed random variable with k degrees of freedom.
The actual test is called Box-Pierce test and it’s test statistic is called the Q statistic. Its formula is as follows:
It can be shown that if the underlying data set is white noise, the expected value of the Q statistic is zero.
For any given time series, one can check if the value of Q deviates from zero in a statistically significant way looking up the p-value of the test statistic in the Chi-square tables for k degrees of freedom. Usually, a p-value of less than 0.05 indicates a significant auto-correlation that cannot be attributed to chance.
The Ljung-Box test for white noise detection
The Ljung-Box test improves upon the Box-Pierce test to obtain a test statistic having a distribution that is closer to the Chi-square distribution than the Q statistic. The test statistic of the Ljung-Box test is calculated as follows, and it is also Chi-square(k) distributed:
Here, n is the number of data points in the time series and k is the number of time lags to be considered. As with the Box-Pierce test, if the underlying data set is white noise, the expected value of this Chi-square distributed random variable is zero. Again, a p-value of less than 0.05 indicates a significant auto-correlation that cannot be attributed to chance.
Example: Testing for white noise using the Ljung-Box test in Python
Let’s run the Ljung-Box test on the restaurant decibel level data set. We will test upto 40 lags and we’ll ask the test to also run the Box-Pierce test.
import statsmodels.stats.diagnostic as diag diag.acorr_ljungbox(df['Decibel'], lags=, boxpierce=True, model_df=0, period=None, return_df=None)
We get the following output:
(array([13172.80554476]), array([0.]), array([13156.42074648]), array([0.]))
The value 13172.80554476 is the value of the test statistic for the Ljung-Box test and 0.0 is its p-value as per the Chi-square(k=40) table.
The value 13156.42074648 is the test statistic of the Box-Pierce test and 0.0 is its p-value as per the Chi-square(k=40) tables.
As we can see, both p-values are less than 0.01 and so we can say with 99% confidence that the restaurant decibel level time series is not pure white noise.
Earlier on, we introduced Random Walks as a special case of the White Noise model and pointed out how easy it is to mistake them for a pattern or trend that can be predicted.
We’ll look at how to avoid making this mistake by applying a technique that will bring out the true random nature of the Random Walk.
Detecting Random Walks
Random walks are often highly correlated. In fact, they are auto-correlated white noise!
The white noise detection tests presented above will latch on these auto-correlations, causing them to conclude that the time series is not white noise.
The remedy is to take the first difference of the time series that is suspected to be a random walk, and run the white noise tests on the differenced series.
If the original time series is a random walk, its first difference is pure white noise.
Let’s illustrate this:
We’ll start by loading a data set that is suspected to be a random walk. The data set can be downloaded from here.
df = pd.read_csv('random_walk.csv', header=0, index_col=) #Let’s plot it to see how the data looks like: df.plot() plt.show()
Let’s run the Ljung-Box white noise test on this data:
diag.acorr_ljungbox(df['Y_i'], lags=, boxpierce=True)
We get the following result:
(array([393833.91252517]), array([0.]), array([392952.07675659]), array([0.]))
The p value of 0.0 indicates that we must strongly reject the null hypothesis that the data is white noise. Both Ljung-Box and Box-Pierce tests think that this data set has not been generated by a pure random process.
This is obviously a false result.
Let’s see if things change after we take the first difference of the data, i.e. we create a new data set with Y = Y_i —Y_(i-1) :
diff_Y_i = df['Y_i'].diff() #drop the NAN in the first row diff_Y_i = diff_Y_i.dropna() #Let’s plot the diff-ed data set diff_Y_i.plot() plt.show()
A very different picture emerges:
Here is the zoomed in view:
Let’s run the Ljung-Box test on the differenced data set:
diag.acorr_ljungbox(diff_Y_i, lags=, boxpierce=True)
We get the following output:
(array([32.93405364]), array([0.77822417]), array([32.85051846]), array([0.78137548]))
Notice that this time the test statistic’s value 32.934 reported by Ljung-Box, and 32.850 reported by Box-Pierce tests is much smaller. And the corresponding p-values detected on the Chi-square(k=40) tables are 0.778 and 0.781 respectively, which are well above 0.05. This is easily enough to support the null hypothesis that the data (i.e. the differenced time series) is pure white noise.
The conclusion to be drawn from this exercise is that one should not fit anything except the White Noise model on this data.
- The white noise model can be used to represent the nature of noise in a data set.
- Testing for white noise is one of the first things that a data scientist should do so as to avoid spending time on fitting models on data sets that offer no meaningfully extract-able information.
- If a data set is not white noise, then after fitting a model to the data, one should run a white noise test on the residual errors to get a sense for how much information the model has been able to extract from the data.
- For time series data, auto-correlation plots and the Ljung-Box test offer two useful techniques for determining if the time series is in reality, just white noise.
References, Citations and Copyrights
Anderson, R. L., Distribution of the Serial Correlation Coefficient, Annals of Mathematical Statistics, Volume 13, Number 1 (1942), 1–13.
Bartlett, M. S. On the Theoretical Specification and Sampling Properties of Autocorrelated Time-Series. Supplement to the Journal of the Royal Statistical Society, vol. 8, no. 1, 1946, pp. 27–41. JSTOR, http://www.jstor.org/stable/2983611.
Quenouille, M. H., The Joint Distribution of Serial Correlation Coefficients, The Annals of Mathematical Statistics, Vol. 20, №4 (Dec., 1949), pp. 561–571
Hyndman, R. J., Athanasopoulos, G., Forecasting: Principles and Practice, OTexts