11.5 - Information Criteria and PRESS

To compare regression models, some statistical software may also give values of statistics referred to as information criterion statistics. For regression models, these statistics combine information about the SSE, number of parameters in the model, and the sample size. A low value, compared to values for other possible models, is good. Some data analysts feel that these statistics give a more realistic comparison of models than the Cp statistic because Cp tends to make models seem more different than they actually are.

Three information criteria that we present are called Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC) (which is sometimes called Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC). The respective formulas are as follows:

$AIC_{k} = n\,\text{ln}(SSE) − n\,\text{ln}(n) + 2(k+1)$

$BIC_{k} = n\,\text{ln}(SSE) − n\,\text{ln}(n) + (k+1)\,\text{ln}(n)$

$APC_{k} =\frac{(n+k+1)}{n(n−k-1)}SSE$

In the formulas, n = sample size and k = number of predictor terms (so k+1 = number of regression parameters in the model being evaluated, including the intercept). Notice that the only difference between AIC and BIC is the multiplier of (k+1), the number of parameters. Each of the information criteria is used in a similar way—in comparing two models, the model with the lower value is preferred.

The BIC places a higher penalty on the number of parameters in the model so will tend to reward more parsimonious (smaller) models. This stems from one criticism of AIC in that it tends to favor models that overfit.

The prediction sum of squares (or PRESS) is a model validation method used to assess a model's predictive ability that can also be used to compare regression models. For a data set of size n, PRESS is calculated by omitting each observation individually and then the remaining n – 1 observations are used to calculate a regression equation which is used to predict the value of the omitted response value (which, recall, we denote by $\hat{y}_{i(i)}$). We then calculate the $i^{\textrm{th}}$ PRESS residual as the difference $y_{i}-\hat{y}_{i(i)}$. Then, the formula for PRESS is given by

$\begin{equation*} \textrm{PRESS}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i(i)})^{2}. \end{equation*}$

In general, the smaller the PRESS value, the better the model's predictive ability.

PRESS can also be used to calculate the predicted $R^{2}$ (denoted by $R^{2}_{pred}$) which is generally more intuitive to interpret than PRESS itself. It is defined as

$\begin{equation*} R^{2}_{pred}=1-\frac{\textrm{PRESS}}{\textrm{SSTO}} \end{equation*}$

and is a helpful way to validate the predictive ability of your model without selecting another sample or splitting the data into training and validation sets in order to assess the predictive ability (see Section 11.7). Together, PRESS and $R^{2}_{pred}$ can help prevent overfitting because both are calculated using observations not included in the model estimation. Overfitting refers to models that appear to provide a good fit for the data set at hand, but fail to provide valid predictions for new observations.

You may notice that $R^{2}$ and $R^{2}_{pred}$ are similar in form. While they will not be equal to each other, it is possible to have $R^{2}$ quite high relative to $R^{2}_{pred}$, which implies that the fitted model is overfitting the sample data. However, unlike $R^{2}$, $R^{2}_{pred}$ ranges from values below 0 to 1.  $R^{2}_{pred}<0$ occurs when the underlying PRESS gets inflated beyond the level of the SSTO. In such a case, we can simply truncate $R^{2}_{pred}$ at 0.

Finally, if the PRESS value appears to be large due to a few outliers, then a variation on PRESS (using the absolute value as a measure of distance) may also be calculated:

$\begin{equation*} \textrm{PRESS}^{*}=\sum_{i=1}^{n}|y_{i}-\hat{y}_{i(i)}|, \end{equation*}$

$\begin{equation*} R^{2*}_{pred}=1-\frac{\textrm{PRESS}^{*}}{\textrm{SSTO}}. \end{equation*}$