###### We’ll look at how to model noise, and how to find out if your data is for all practical purposes, just noise

White noise are variations in your data that cannot be explained by any regression model.

And yet, there happens to be a statistical model for white noise. It goes like this for time series data:

The observed value *Y_i* at time step *i* is the sum of the current level *L_i* and a random component *N_i* around the current level.

If the extent of random variation is proportional to the current level, then we have the following multiplicative version of the same model:

If the current level *L_i* is constant for all *i*, i.e. *L_i = L* for all *i*, then the noise will be seen to fluctuate around a fixed level.

It’s easy to generate a white noise data set. Here’s how to do it in Excel:

And here is the output plot of noise that is fluctuating around a constant level of 100:

The current level *L_i *often changes in response to real world factors. For example, if *L_i* changes linearly in response to a set of regression variables ** X**, then we get the following linear regression model:

In the above equation, ** β** is the vector of regression coefficients and

**is a vector of regression variables.**

*X_i*## Why is it important to study the white noise model?

There are three reasons why:

- If you discover using some techniques which I will describe soon, that your data is basically white noise around a fixed level, then the best that you can do is fit a model around that fixed level. It will be a waste of time to try to do anything better than that.
- Suppose you have already fitted a regression model to a data set. If you are able to show that the residual errors of the fitted model are white noise, it means your model has done a great job of explaining the variance in the dependent variable. There is nothing left to extract in the way of information and whatever is left is noise. You can pat yourself on the back for a job well done!
- Thirdly, the white noise model happens to be a stepping stone to another important and famous model in statistics called the Random Walk model which I will explain in the next section.

## The Random Walk Model

Let’s again look at the White Noise Model’s equation:

If we make the level level *L_i* at time step *i* be the output value of the model from the previous time step *(i-1)*, we get the **Random Walk model,** made famous in the popular literature by Burton Malkiel’s A Random Walk Down Wall Street.

The Random Walk model is like the mirage of the Data Science dessert. It has lured many profit-thirsty investors into betting (and losing) their shirt on illusions of trends in stock price movements, movements that were in reality little more than a random walk.

Here’s a plot of data that was generated using the Random Walk model:

Just tell me you don’t see any trends in this plot!

If you are not completely convinced that the above data can be generated by a purely random process, let’s puff away any remaining illusions by showing how to generate this data in Excel:

Let’s look at how we can make use of our knowledge of white noise and random walks to try to detect their presence in time series data.

## How to detect white noise in a time series data set

We’ll look at 3 tests to determine whether your time series is in reality, just white noise:

- Auto-correlation plots
- The Box-Pierce test
- The Ljung-Box test

## Testing for white noise using auto-correlation plots

When two variables move up or down in unison (or if one value goes up, the other one goes down), they are said to be positively (or negatively) correlated. The correlation coefficient can be used to measure the degree of *linear* correlation between two such variables:

In the above formula, *E(**X**)* and *E(**Y**)* are the expected (i.e. mean) values of ** X** and

**.**

*Y**σ_X*and

*σ_Y*are the standard deviations of

**and**

*X***.**

*Y*In time series data, correlations often exist between the current value and values that are 1 time step or more older than the current value, i.e. between *Y_i* and *Y_(i-1)*, between *Y_i* and *Y_(i-2)* and so on. Stock price changes often show such patterns of positive and negative correlations (and beware, so do data containing random walks!).

Because the values are correlated with past versions of themselves, we call them auto, meaning self correlated.

Here is the formula for calculating the auto-correlation coefficient between *Y_i* and *Y_(i-k)*:

Before we can show how this auto-correlation coefficient *r_k* can be used to detect white noise, we need to take a short and pleasant side-trip into the land of random variables. I’ll explain why *r_k* is a normally distributed random variable and how this property of *r_k* can be used to detect white noise.

## Distribution of *the LAG-k auto-correlation coefficient *r_k

For any lag

k,r_kis a normally distributed random variable with some meanµ_k and varianceσ²_k.

To understand why, consider this thought experiment:

- Take a time series data set containing 100,000 time points.
- Draw 5000 randomly selected samples from this data set. Suppose each sample is of length 100
*continuous*time points. - For each sample, calculate the LAG-1 auto-correlation coefficient
*r_1*using the above formula for*r_k*. - One can see that each time,
*r_1*will come out to be some value between 0 and 1 for each sample of 100 time points. So we end up with 5000 values of r_1, each a number between 0 and 1. Thus*r_1*is a random variable for which we have measured 5000 values. - By appealing to the Limit Theorems of statistics,
*it can be shown r_1 is a normally distributed random variable,*and the distribution of*r_1*is centered at some population mean, we’ll call it*µ_1,*and some variance, we’ll call it*σ²_1*. In practice, the observed mean and variance of*r_1*will be somewhere close to the mean of the 5000 values of*r_1*which we measured. - By repeating the above experiment for all lags
*k*, it can be shown that*auto-correlation coefficients for all lags are normally distributed random variables*with mean*µ_k*and variance*σ²_k.*

*Symbolically:*

## Implications for detecting white noise

If the time series is white noise, then in theory, its current value *T_i* ought not be correlated at all with past values *T_(i-1), T_(i-2)* etc, and the corresponding auto-correlation coefficients *r_1, r_2,…*etc. will be zero or close to zero.

i.e.when the time series is white noise, *r_k is 0 *for all *k = 1, 2, 3,…*

But we have just seen that *r_k* is a *N(µ_k, σ²_k) *random variable.

Putting the above two facts together, we arrive at the following first important implication:

If the time series is white noise, then the auto-correlation coefficient *r_k* for all lags k will have a **zero mean** and some variance *σ²_k.*

*Symbolically:*

But what about the variance *σ²_k *of the coefficients *r_k*?

Anderson, Bartlett and Quenouille have shown that under white noise conditions, the standard deviation *σ_k *is as follows:

*σ_k = 1/sqrt(n)*

Where *n* is the same size. Recollect that in our thought experiment, *n* was 100.

Thus, we know that *r_k* under white noise conditions has the following distribution:

An important property of the normal distribution is that approximately 95% of it lies within 1.96 standard deviations from the mean. In our case, the mean is 0 and standard deviation is *1/sqrt(n)*, so we get the following 95% confidence interval for the auto-correlation coefficients:

These results yield the following procedure for conducting the white noise test using the auto-correlation coefficients *r_k*:

- Calculate the first
*k*auto-correlation coefficients*r_k*.*k*can be set to some high enough value depending on the length*n*of the time series data set. - Calculate the 95% confidence interval
*[ — 1.96/sqrt(n), +1.96/sqrt(n)].* - If for all
*k*, if r_k lies within the above confidence interval, conclude at a 95% confidence level that the time series is in reality,*possibly*just white noise. We say*possibly*because if we experiment with larger sample sizes, i.e. larger*n*, the size of the confidence interval will shrink, and values of*r_k*that were previously inside the 95% bounds will now lie outside the 95% bounds. - If any of the
*r_k*lie outside the confidence interval, then the time series*possibly*has information in it.

## Example: White noise detection using Python

Let’s illustrate the above procedure using a real world time series of 5000 decibel level measurements taken at a restaurant using the Google Science Journal app.

The restaurant decibel levels data set can be downloaded from here.

We’ll use the pandas library to load the data set from the csv file and plot it:

```
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('restaurant_decibel_level.csv', header=0, index_col=[0])
```

Let’s print the top 10 rows:

```
df.head(10)
```

Decibel

TimeIndex

0 55.931323

40 57.779260

80 62.956952

140 65.158100

180 60.325242

220 45.411725

262 55.958807

300 62.021807

340 62.222563

380 56.156684

Let’s plot all 5000 values in the series:

Let’s fetch and plot the auto-correlation coefficients for the first 40 lags. We’ll the statsmodels library to do that.

```
import statsmodels.graphics.tsaplots as tsa
tsa.plot_acf(df['Decibel'], lags=40, alpha=0.05, title='Auto-correlation coefficients for lags 1 through 40')
```

The alpha=0.05 tells statsmodels to also plot the 95% confidence interval region. We get the following plot:

As we can see, the time series contains significant auto-correlations up through lags 17. Incidentally, the auto-correlation at lag 0 is always 1.0 as a value is always perfectly correlated with itself.

There is wave-like pattern in the auto-correlation plot that indicates that there could be some seasonality contained in the data. We can try to identify and isolate the seasonality by decomposing the time series into the trend, seasonality and noise components.

For now we’ll focus on the noise portion. The bottom line is that this time series, in its current form, does not appear to be pure white noise.

Next, we’ll two more tests on the time series to confirm this.

## The Chi-squared test for white noise detection

The Chi-squared test is based on this powerful result in statistics: the sum of squares of *k* identical standard normal random variables is a Chi-squared distributed random variable with *k* degrees of freedom.

The actual test is called **Box-Pierce** test and it’s test statistic is called the Q statistic. Its formula is as follows:

It can be shown that if the underlying data set is white noise, the expected value of the Q statistic is zero.

For any given time series, one can check if the value of Q deviates from zero in a statistically significant way looking up the p-value of the test statistic in the Chi-square tables for *k* degrees of freedom. Usually, a p-value of less than 0.05 indicates a significant auto-correlation that cannot be attributed to chance.

## The Ljung-Box test for white noise detection

The Ljung-Box test improves upon the Box-Pierce test to obtain a test statistic having a distribution that is closer to the Chi-square distribution than the Q statistic. The test statistic of the Ljung-Box test is calculated as follows, and it is also Chi-square(k) distributed:

Here, *n* is the number of data points in the time series and *k* is the number of time lags to be considered. As with the Box-Pierce test, if the underlying data set is white noise, the expected value of this Chi-square distributed random variable is zero. Again, a p-value of less than 0.05 indicates a significant auto-correlation that cannot be attributed to chance.

#### Example: Testing for white noise using the Ljung-Box test in Python

Let’s run the Ljung-Box test on the restaurant decibel level data set. We will test upto 40 lags and we’ll ask the test to also run the Box-Pierce test.

```
import statsmodels.stats.diagnostic as diag
diag.acorr_ljungbox(df['Decibel'], lags=[40], boxpierce=True, model_df=0, period=None, return_df=None)
```

We get the following output:

(array([13172.80554476]), array([0.]), array([13156.42074648]), array([0.]))

The value **13172.80554476** is the value of the test statistic for the Ljung-Box test and **0.0** is its p-value as per the Chi-square(k=40) table.

The value **13156.42074648 **is the test statistic of the Box-Pierce test and 0.0 is its p-value as per the Chi-square(k=40) tables.

As we can see, both p-values are less than 0.01 and so we can say with 99% confidence that the restaurant decibel level time series is not pure white noise.

Earlier on, we introduced Random Walks as a special case of the White Noise model and pointed out how easy it is to mistake them for a pattern or trend that can be predicted.

We’ll look at how to avoid making this mistake by applying a technique that will bring out the true random nature of the Random Walk.

## Detecting Random Walks

Random walks are often highly correlated. In fact, they are auto-correlated white noise!

The white noise detection tests presented above will latch on these auto-correlations, causing them to conclude that the time series is not white noise.

The remedy is to take the first difference of the time series that is suspected to be a random walk, and run the white noise tests on the differenced series.

If the original time series is a random walk, its first difference is pure white noise.

Let’s illustrate this:

We’ll start by loading a data set that is suspected to be a random walk. The data set can be downloaded from here.

```
df = pd.read_csv('random_walk.csv', header=0, index_col=[0])
#Let’s plot it to see how the data looks like:
df.plot()
plt.show()
```

Let’s run the Ljung-Box white noise test on this data:

```
diag.acorr_ljungbox(df['Y_i'], lags=[40], boxpierce=True)
```

We get the following result:

(array([393833.91252517]), array([0.]), array([392952.07675659]), array([0.]))

The p value of 0.0 indicates that we must strongly reject the null hypothesis that the data is white noise. *Both Ljung-Box and Box-Pierce tests think that this data set has **not **been generated by a pure random process.*

This is obviously a false result.

Let’s see if things change after we take the first difference of the data, i.e. we create a new data set with *Y = Y_i —Y_(i-1)* :

```
diff_Y_i = df['Y_i'].diff()
#drop the NAN in the first row
diff_Y_i = diff_Y_i.dropna()
#Let’s plot the diff-ed data set
diff_Y_i.plot()
plt.show()
```

A very different picture emerges:

Here is the zoomed in view:

Let’s run the Ljung-Box test on the differenced data set:

```
diag.acorr_ljungbox(diff_Y_i, lags=[40], boxpierce=True)
```

We get the following output:

(array([32.93405364]), array([0.77822417]), array([32.85051846]), array([0.78137548]))

Notice that this time the test statistic’s value **32.934** reported by Ljung-Box, and **32.850** reported by Box-Pierce tests is much smaller. And the corresponding p-values detected on the Chi-square(k=40) tables are **0.778 **and **0.781 respectively**, which are well above 0.05. This is easily enough to support the null hypothesis that the data (i.e. the differenced time series) is pure white noise.

*The conclusion to be drawn from this exercise is that one should not fit anything except the White Noise model on this data.*

## Summary

- The white noise model can be used to represent the nature of noise in a data set.
- Testing for white noise is one of the first things that a data scientist should do so as to avoid spending time on fitting models on data sets that offer no meaningfully extract-able information.
- If a data set is not white noise, then after fitting a model to the data, one should run a white noise test on the residual errors to get a sense for how much information the model has been able to extract from the data.
- For time series data, auto-correlation plots and the Ljung-Box test offer two useful techniques for determining if the time series is in reality, just white noise.

## Related

How To Isolate Trend, Seasonality And Noise From A Time Series

## References, Citations and Copyrights

### Data sets

Restaurant decibel levels data is copyright Sachin Date under CC-BY-NC-SA. **Data set download link**.

Amgen stock price chart is from stockcharts.com under these terms of use.

### Papers

Anderson, R. L., Distribution of the Serial Correlation Coefficient, *Annals of Mathematical Statistics, Volume 13, Number 1 (1942), 1–13*.

Bartlett, M. S. On the Theoretical Specification and Sampling Properties of Autocorrelated Time-Series. *Supplement to the Journal of the Royal Statistical Society*, vol. 8, no. 1, 1946, pp. 27–41. *JSTOR*, http://www.jstor.org/stable/2983611.

Quenouille, M. H., The Joint Distribution of Serial Correlation Coefficients, *The Annals of Mathematical Statistics, Vol. 20, №4 (Dec., 1949), pp. 561–571*

### Book

Hyndman, R. J., Athanasopoulos, G., Forecasting: Principles and Practice, *OTexts*

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **How To Isolate Trend, Seasonality And Noise From Time Series Data Sets

**NEXT: **The Assumptions Of Linear Regression, And How To Test Them