14.5 - Advanced Methods

14.5 - Advanced Methods
Note that the material in Sections 14.5.1-14.5.4 is considerably more technical than preceding Lessons. It is offered as an introduction to this advanced topic and, given the technical nature of the material, it could be considered optional in the context of this course.

This section provides a few more advanced techniques used in time series analysis. While they are more peripheral to the autoregressive error structures that we have discussed, they are germane to this lesson since these models are constructed in a regression framework. Moreover, a course in regression is generally a prerequisite for any time series course, so it is helpful to at least provide exposure to some of the more commonly studied time series techniques.


14.5.1 - ARIMA Models

14.5.1 - ARIMA Models

Since many of the time series models have a regression structure, it is beneficial to introduce a general class of time series models called autoregressive integrated moving average or ARIMA models. They are also referred to as Box-Jenkins models, due to the systematic methodology of identifying, fitting, checking, and utilizing ARIMA models, which was popularized by George Box and Gwilym Jenkins in 1976. Before we write down a general ARIMA model, we need to introduce a few additional concepts.

Suppose we have the time series \(Y_{1},Y_{2},\ldots,Y_{t}\). If the value for each Y is determined exactly by a mathematical formula, then the series is said to be deterministic. If the future values of Y can only be described through their probability distribution, then the series is said to be a stochastic process.1 A special class of stochastic processes is a stationary stochastic process, which occurs when the probability distribution for the process is the same for all starting values of t. Such a process is completely defined by its mean, variance, and autocorrelation function. When a time series exhibits nonstationary behavior, then part of our objective will be to transform it into a stationary process.

When stationarity is not an issue, then we can define an autoregressive moving average or ARMA model as follows:

\(\begin{equation*} Y_{t}=\sum_{i=1}^{p}\phi_{i}Y_{t-i}+a_{t}-\sum_{j=1}^{q}\theta_{j}a_{t-j}, \end{equation*} \)

where \(\phi_{1},\ldots,\phi_{p}\) are the autoregressive parameters to be estimated, \(\theta_{1},\ldots,\theta_{q}\) are the moving average parameters to be estimated, and \(a_{1},\ldots,a_{t}\) are a series of unknown random errors (or residuals) that are assumed to follow a normal distribution. This is also referred to as an ARMA(p,q) model. The model can be simplified by introducing the Box-Jenkins backshift operator, which is defined by the following relationship:

\(\begin{equation*} B^{p}X_{t}=X_{t-p}, \end{equation*}\)

such that \(X_{1},\ldots,X_{t}\) is any time series and \(p<t\). Using the backshift notation yields the following:

\(\begin{equation*} \biggl(1-\sum_{i=1}^{p}\phi_{i}B^{i}\biggr)Y_{t}=\biggl(1-\sum_{j=1}^{q}\theta_{j}B^{j}\biggr)a_{t}, \end{equation*}\)

which is often reduced further to

\(\begin{equation*} \phi_{p}(B)Y_{t}=\theta_{q}(B)a_{t}, \end{equation*}\)

where \(\phi_{p}(B)=(1-\sum_{i=1}^{p}\phi_{i}B^{i})\) and \(\theta_{q}(B)=(1-\sum_{j=1}^{q}\theta_{j}B^{j})\).

When time series exhibit nonstationary behavior (which commonly occurs in practice), then the ARMA model presented above can be extended and written using differences, which are defined as follows:

\(\begin{align*} W_{t}&=Y_{t}-Y_{t-1}=(1-B)Y_{t}\\ W_{t}-W_{t-1}&=Y_{t}-2Y_{t-1}+Y_{t-2}\\ &=(1-B)^{2}Y_{t}\\ & \ \ \vdots\\ W_{t}-\sum_{k=1}^{d}W_{t-k}&=(1-B)^{d}Y_{t}, \end{align*}\)

where d is the order of differencing. Replacing \(Y_{t}\) in the ARMA model with the differences defined above yields the formal ARIMA(p,d,q) model:

\(\begin{equation*} \phi_{p}(B)(1-B)^{d}Y_{t}=\theta_{q}(B)a_{t}. \end{equation*}\)

An alternative way to deal with nonstationary behavior is to simply fit a linear trend to the time series and then fit a Box-Jenkins model to the residuals from the linear fit. This will provide a different fit and a different interpretation, but is still a valid way to approach a process exhibiting nonstationarity.

Seasonal time series can also be incorporated into a Box-Jenkins framework. The following general model is usually recommended:

\(\begin{equation*} \phi_{p}(B)\Phi(P)(B^{s})(1-B)^{d}(1-B^{s})^{D}Y_{t}=\theta_{q}(B)\Theta_{Q}(B^{s})a_{t}, \end{equation*}\)

where p, d, and q are as defined earlier, s is a (known) number of seasons per timeframe (e.g., years), D is the order of the seasonal differencing, and P and Q are the autoregressive and moving average orders, respectively, when accounting for the seasonal shift. Moreover, the operator polynomials \(\phi_{p}(B)\) and \(\theta_{q}(B)\) are as defined earlier, while \(\Phi_{P}(B)=(1-\sum_{i=1}^{P}\Phi_{i}B^{s\times i})\) and \(\Theta_{Q}(B)=(1-\sum_{j=1}^{Q}\Theta_{j}B^{s\times j})\). Luckily, the maximum value of p, d, q, P, D, and Q is usually 2, so the resulting expression is relatively simple.

With all of the modeling discussion provided above, we will now provide a brief overview of how to implement the Box-Jenkins methodology in practice, which (fortunately) most statistical software packages will perform for you:

  1. Model Identification: Using plots of the data, autocorrelations, partial autocorrelations, and other information, a class of simple ARIMA models is selected. This amounts to estimating (or guesstimating) an appropriate value for d followed by estimates for p and q (as well as P, D, and Q in the seasonal time series setting).
  2. Model Estimation: The autoregressive and moving average parameters are found via an optimization method like maximum likelihood.
  3. Diagnostic Checking: The fitted model is checked for inadequacies by studying the autocorrelations of the residual series (i.e., the time-ordered residuals).

The above is then iterated until there appears to be minimal to no improvement in the fitted model.

1 It should be noted that stochastic processes is itself a heavily-studied and very important statistical subject.


14.5.2 - Exponential Smoothing

14.5.2 - Exponential Smoothing

The techniques of the previous section can all be used in the context of forecasting, which is the art of modeling patterns in the data that are usually visible in time series plots and then extrapolated into the future.  In this section, we discuss exponential smoothing methods that rely on smoothing parameters, which are parameters that determine how fast the weights of the series decay.  For each of the three methods we discuss below, the smoothing constants are found objectively by selecting those values which minimize one of the three error-size criterion below:

\[\begin{align*} \textrm{MSE}&=\frac{1}{n}\sum_{t=1}^{n}e_{t}^{2}\\ \textrm{MAE}&=\frac{1}{n}\sum_{t=1}^{n}|e_{t}|\\ \textrm{MAPE}&=\frac{100}{n}\sum_{t=1}^{n}\biggl|\frac{e_{t}}{Y_{t}}\biggr|, \end{align*}\]

where the error is the difference between the actual value of the time series at time t and the fitted value at time t, which is determined by the smoothing method employed.  Moreover, MSE is (as usual) the mean square error, MAE is the mean absolute error, and MAPE is the mean absolute percent error.

Exponential smoothing methods also require initialization since the forecast for period one requires the forecast at period zero, which we do not (by definition) have.  Several methods have been proposed for generating starting values.  Most commonly used is the backcasting method, which entails reversing the series so that we forecast into the past instead of into the future.  This produces the required starting value.  Once we have done this, we then switch the series back and apply the exponential smoothing algorithm in the regular manor.

Single exponential smoothing smoothes the data when no trend or seasonal components are present.  The equation for this method is:

\[\begin{equation*} \hat{Y}_{t}=\alpha(Y_{t}+\sum_{i=1}^{r}(1-\alpha)^{i}Y_{t-i}), \end{equation*}\]

where \(\hat{Y}_{t}\) is the forecasted value of the series at time t and \(\alpha\) is the smoothing constant.  Note that \(r<t\), but r does not have to equal \(t-1\).  From the above equation, we see that the method constructs a weighted average of the observations. The weight of each observation decreases exponentially as we move back in time.  Hence, since the weights decrease exponentially and averaging is a form of smoothing, the technique was named exponential smoothing.  An equivalent ARIMA(0,1,1) model can be constructed to represent the single exponential smoother.

Double exponential smoothing (also called Holt's method) smoothes the data when a trend is present.  The double exponential smoothing equations are:

\[\begin{align*} L_{t}&=\alpha Y_{t}+(1-\alpha)(L_{t-1}+T_{t-1})\\ T_{t}&=\beta(L_{t}-L_{t-1})+(1-\beta)T_{t-1}\\ \hat{Y}_{t}&=L_{t-1}+T_{t-1}, \end{align*}\]

where \(L_{t}\) is the level at time t, \(\alpha\) is the weight (or smoothing constant) for the level, \(T_{t}\) is the trend at time t, \(\beta\) is the weight (or smoothing constant) for the trend, and all other quantities are defined as earlier.  An equivalent ARIMA(0,2,2) model can be constructed to represent the double exponential smoother.

Finally, Holt-Winters exponential smoothing smoothes the data when trend and seasonality are present; however, these two components can be either additive or multiplicative.  For the additive model, the equations are:

\[\begin{align*} L_{t}&=\alpha(Y_{t}-S_{t-p})+(1-\alpha)(L_{t-1}+T_{t-1})\\ T_{t}&=\beta(L_{t}-L_{t-1})+(1-\beta)T_{t-1}\\ S_{t}&=\delta(Y_{t}-L_{t})+(1-\delta)S_{t-p}\\ \hat{Y}_{t}&=L_{t-1}+T_{t-1}+S_{t-p}. \end{align*}\]

For the multiplicative model, the equations are:

\[\begin{align*} L_{t}&=\alpha(Y_{t}/S_{t-p})+(1-\alpha)(L_{t-1}+T_{t-1})\\ T_{t}&=\beta(L_{t}-L_{t-1})+(1-\beta)T_{t-1}\\ S_{t}&=\delta(Y_{t}/L_{t})+(1-\delta)S_{t-p}\\ \hat{Y}_{t}&=(L_{t-1}+T_{t-1})S_{t-p}. \end{align*}\]

For both sets of equations, all quantities are the same as they were defined in the previous models, except now we also have that \(S_{t}\) is the seasonal component at time t, \(\delta\) is the weight (or smoothing constant) for the seasonal component, and p is the seasonal period.


14.5.3 - Spectral Analysis

14.5.3 - Spectral Analysis

Suppose we believe that a time series, \(Y_{t}\), contains a periodic (cyclic) component. Spectral analysis takes the approach of specifying a time series as a function of trigonometric components. A natural model of the periodic component would be

\(\begin{equation*} Y_{t}=R\cos(ft+d)+e_{t}, \end{equation*}\)

where R is the amplitude of the variation, f is the frequency of periodic variation1, and d is the phase. Using the trigonometric identity \(\cos(A+B)=\cos(A)cos(B)-sin(A)sin(B)\), we can rewrite the above model as

\(\begin{equation*} Y_{t}=a\cos(ft)+b\sin(ft)+e_{t}, \end{equation*}\)

where \(a=R\cos(d)\) and \(b=-R\sin(d)\). Thus, the above is a multiple regression model with two predictors.

Variation in time series may be modeled as the sum of several different individual waves occurring at different frequencies. The generalization of the above model as a sum of k frequencies is

\(\begin{equation*} Y_{t}=\sum_{j=1}^{k}R_{j}\cos(f_{j}t+d_{j})+e_{t}, \end{equation*}\)

which can be rewritten as

\(\begin{equation*} Y_{t}=\sum_{j=1}^{k}a_{j}\cos(f_{j}t)+\sum_{j=1}^{k}b_{j}\sin(f_{j}t)+e_{t}. \end{equation*}\)

Note that if the \(f_{j}\) values were known constants and we let \(X_{t,r}=\cos(f_{r}t)\) and \(Z_{t,r}=\sin(f_{r}t)\), then the above could be rewritten as the multiple regression model

\(\begin{equation*} Y_{t}=\sum_{j=1}^{k}a_{j}X_{t,j}+\sum_{j=1}^{k}b_{j}Z_{t,j}+e_{t}. \end{equation*}\)

The above is an example of a harmonic regression.

Fourier analysis is the study of approximating functions using the sum of sine and cosine terms. Such a sum is called the Fourier series representation of the function. Spectral analysis is identical to Fourier analysis, except that instead of approximating a function, the sum of sine and cosine terms approximates a time series that includes a random component. Moreover, the coefficients in the harmonic regression (i.e., the a's and b's) may be estimated using multiple regression techniques.

Figures can also be constructed to help in the spectral analysis of a time series. While we do not develop the details here, the basic methodology consists of partitioning the total sum of squares into quantities associated with each frequency (like an ANOVA). From these quantities, histograms of the frequency (or wavelength) can be constructed, which are called periodograms. A smoothed version of the periodogram, called a spectral density, can also be constructed and is generally preferred to the periodogram.


14.5.4 - Generalized Least Squares

14.5.4 - Generalized Least Squares

Weighted least squares can also be used to reduce autocorrelation by choosing an appropriate weighting matrix. In fact, the method used is more general than weighted least squares. The method of weighted least squares uses a diagonal matrix to help correct for non-constant variance. However, in a model with correlated errors, the errors have a more complicated variance-covariance structure (such as \(\Sigma(1)\) given earlier for the AR(1) model). Thus the weighting matrix for the more complicated variance-covariance structure is non-diagonal and utilizes the method of generalized least squares, of which weighted least squares is a special case. In particular, when \(\mbox{Var}(\textbf{Y})=\mbox{Var}(\pmb{\epsilon})=\Omega\), the objective is to find a matrix \(\Lambda\) such that:

\(\begin{equation*} \mbox{Var}(\Lambda\textbf{Y})=\Lambda\Omega\Lambda^{\textrm{T}}=\sigma^{2}\textbf{I}_{n\times n}, \end{equation*}\)

The generalized least squares estimator (sometimes called the Aitken estimator) take\(s \Lambda=\sigma\Omega^{1/2}\) and is given by

\(\begin{align*} \hat{\beta}_{\textrm{GLS}}&=\arg\min_{\beta}\|\Lambda(\textbf{Y}-\textbf{X}\beta)\|^{2} \\ &=(\textbf{X}^{\textrm{T}}\Omega^{-1}\textbf{X})^{-1}\textbf{X}^{\textrm{T}}\Omega\textbf{Y}. \end{align*}\)

There is no optimal way of choosing such a weighting matrix \(\Omega\). \(\Omega\) contains \(n(n+1)/2\) parameters, which is too many parameters to estimate, so restrictions must be imposed. The estimate will heavily depend on these assumed restrictions and thus makes the method of generalized least squares difficult to use unless you are savvy (and lucky) enough to choose helpful restrictions.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility