3.2 Diagnostics

3.2 Diagnostics

Analyzing possible statistical significance of autocorrelation values

The Ljung-Box statistic, also called the modified Box-Pierce statistic, is a function of the accumulated sample autocorrelations, rj, up to any specified time lag \(m\). As a function of \(m\), it is determined as:

\(Q(m) = n(n+2)\sum_{j=1}^{m}\frac{r^2_j}{n-j},\)

where n = number of usable data points after any differencing operations. (Please visit forvo for the proper pronunciation of Ljung.)

As an example,

\(Q(3) = n(n+2)\left(\frac{r^2_1}{n-1}+\frac{r^2_2}{n-2}+\frac{r^2_3}{n-3}\right)\)

Use of the Statistic

This statistic can be used to examine residuals from a time series model in order to see if all underlying population autocorrelations for the errors may be 0 (up to a specified point).

For nearly all models that we consider in this course, the residuals are assumed to be “white noise,” meaning that they are identically, independently distributed (from each other). Thus, as we saw last week, the ideal ACF for residuals is that all autocorrelations are 0. This means that \(Q(m)\) should be 0 for any lag \(m\). A significant \(Q(m)\) for residuals indicates a possible problem with the model.

(Remember \(Q(m)\) measures accumulated autocorrelation up to lag \(m\).)

Distribution of \(Q(m)\)

There are two cases:

  1. When the \(r_j\) are sample autocorrelations for residuals from a time series model, the null hypothesis distribution of \(Q(m)\) is approximately a \(\chi^2\) distribution with df = \(m-p-q-1\), where \(p+q+1\) is the number of coefficients in the model, including a constant.
    Note!
    \(m\) is the lag to which we’re accumulating, so in essence, the statistic is not defined until \( m > p+q+1\).
  2. When no model has been used, so that the ACF is for raw data, the null distribution of \(Q(m)\) is approximately a \(\chi^2\) distribution with df = \(m\).

p-Value Determination

In both cases, a p-value is calculated as the probability past \(Q(m)\) in the relevant distribution. A small p-value (for instance, p-value < .05) indicates the possibility of non-zero autocorrelation within the first \(m\) lags.

Example 3-3

Below there is Minitab output for the Lake Erie level data that was used for homework 1 and in Lesson 3.1. A useful model is an AR(1) with a constant. So \(p+q+1=1+0+1=2.\)

Final Estimates of Parameters

Type Coef SE Coef T P
AR 1 0.7078 0.1161 6.10 0.000
Constant 4.2761 0.1953 21.89 0.000
Mean 14.6349 0.6684    

Modified Box-Pierce (Ljung-Box) Chi-Square statistic

Lag 12 24 36 48
Chi-Square 9.4 23.2 30.0 *
DF 10 22 34 *
P-Value 0.493 0.390 0.662 *

Minitab gives p-values for accumulated lags that are multiples of 12. The R sarima command will give a graph that shows p-values of the Ljung-Box-Pierce tests for each lag (in steps of 1) up to a certain lag, usually up to lag 20 for nonseasonal models.

Interpretation of the Box-Pierce Results

Notice that the p-values for the modified Box-Pierce all are well above .05, indicating “non-significance.” This is a desirable result. Remember that there only 40 data values, so there’s not much data contributing to correlations at high lags. Thus, the results for \(m\) = 24 and \(m\) = 36 may not be meaningful.

Graphs of ACF values

When you request a graph of the ACF values, "significance" limits are shown by R and by Minitab. In general, the limits for the autocorrelation are placed at \(0 ± 2\) standard errors of \(r_k\). The formula used for standard error depends upon the situation.

  • Within the ACF of residuals as part of the ARIMA routine, the standard errors are determined assuming the residuals are white noise. The approximate formula for any lag is that s.e. of \(r_k=1/\sqrt{n}\).
  • For the ACF of raw data (the ACF command), the standard error at a lag k is found as if the right model was an MA(k-1). This allows the possible interpretation that if all autocorrelations past a certain lag are within the limits, the model might be an MA of order defined by the last significant autocorrelation.

Appendix: Standardized Residuals

What are standardized residuals in a time series framework? One of the things that we need to look at when we look at the diagnostics from a regression fit is a graph of the standardized residuals. Let's review what this is for regular regression where the standard deviation is \(\sigma\). The standardized residual at observation i

\(\dfrac{y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}}{\sigma},\)

should be N(0, 1). We hope to see normality when we look at the diagnostic plots. Another way to think about this is:

\(\dfrac{y_i - \beta_0 -\sum_{j=1}^{p}\beta_j x_{ij}}{\sigma} \approx \dfrac{y_i - \widehat{y}_i}{\sqrt{Var(y_i - \widehat{y}_i)}}.\)

Now, with time series things are very similar:

\(\dfrac{x_t-\tilde{x_t}}{\sqrt{P^{t-1}_t}},\)

where

\(\tilde{x_t} = E(x_t|x_{t-1}, x_{t-2}, \dots) \text{ and } P^{t-1}_t = E\left[(x_t-\tilde{x_t})^2\right].\)

This is where the standardized residuals come from. This is also essentially how a time series is fit using R. We want to minimize the sums of these squared values:

\(\sum_{t=1}^{n}\left(\dfrac{x_t - \tilde{x_t}}{\sqrt{P^{t-1}_t}}\right)^2\)

(In reality, it is slightly more complicated. The log-likelihood function is minimized, and this is one term of that function.)


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility