10.5 - Information Criteria and PRESS

To compare regression models, some statistical software may also give values of statistics referred to as information criterion statistics. For regression models, these statistics combine information about the SSE, the number of parameters in the model, and the sample size. A low value, compared to values for other possible models, is good. Some data analysts feel that these statistics give a more realistic comparison of models than the \(C_p\) statistic because \(C_p\)tends to make models seem more different than they actually are.

Three information criteria that we present are called Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC) (which is sometimes called Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC). The respective formulas are as follows:

\(AIC_{p} = n\,\text{ln}(SSE) − n\,\text{ln}(n) + 2p \)

\(BIC_{p} = n\,\text{ln}(SSE) − n\,\text{ln}(n) + p\,\text{ln}(n)\)

\(APC_{p} =\dfrac{(n + p)}{n(n − p)}SSE\)

In the formulas, n = sample size and p = number of regression coefficients in the model being evaluated (including the intercept). Notice that the only difference between AIC and BIC is the multiplier of p, the number of parameters. Each of the information criteria is used in a similar way — in comparing two models, the model with the lower value is preferred.

The BIC places a higher penalty on the number of parameters in the model so will tend to reward more parsimonious (smaller) models. This stems from one criticism of AIC in that it tends to overfit models.

The prediction sum of squares (or PRESS) is a model validation method used to assess a model's predictive ability that can also be used to compare regression models. For a data set of size n, PRESS is calculated by omitting each observation individually and then the remaining n - 1 observations are used to calculate a regression equation which is used to predict the value of the omitted response value (which, recall, we denote by \(\hat{y}_{i(i)}\)). We then calculate the \(i^{\textrm{th}}\) PRESS residual as the difference \(y_{i}-\hat{y}_{i(i)}\). Then, the formula for PRESS is given by

\(\begin{equation*} \textrm{PRESS}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i(i)})^{2}. \end{equation*}\)

In general, the smaller the PRESS value, the better the model's predictive ability.

PRESS can also be used to calculate the predicted \(R^{2}\) (denoted by \(R^{2}_{pred}\)) which is generally more intuitive to interpret than PRESS itself. It is defined as

\(\begin{equation*} R^{2}_{pred}=1-\dfrac{\textrm{PRESS}}{\textrm{SSTO}} \end{equation*}\)

and is a helpful way to validate the predictive ability of your model without selecting another sample or splitting the data into training and validation sets in order to assess the predictive ability (see Section 10.6). Together, PRESS and \(R^{2}_{pred}\) can help prevent overfitting because both are calculated using observations not included in the model estimation. Overfitting refers to models that appear to provide a good fit for the data set at hand, but fail to provide valid predictions for new observations.

You may notice that \(R^{2}\) and \(R^{2}_{pred}\) are similar in form. While they will not be equal to each other, it is possible to have \(R^{2}\) quite high relative to \(R^{2}_{pred}\), which implies that the fitted model is overfitting the sample data. However, unlike \(R^{2}\), \(R^{2}_{pred}\) ranges from values below 0 to 1.  \(R^{2}_{pred}<0\) occurs when the underlying PRESS gets inflated beyond the level of the SSTO. In such a case, we can simply truncate \(R^{2}_{pred}\) at 0.

Finally, if the PRESS value appears to be large, due to a few outliers, then a variation on PRESS (using the absolute value as a measure of distance) may also be calculated:

\(\begin{equation*} \textrm{PRESS}^{*}=\sum_{i=1}^{n}|y_{i}-\hat{y}_{i(i)}|, \end{equation*}\)

which also leads to

\(\begin{equation*} R^{2*}_{pred}=1-\dfrac{\textrm{PRESS}^{*}}{\textrm{SSTO}}. \end{equation*}\)

Try It!

Predicted \(R^{2}\) Section

Review the examples in Section 10.4 to see how the various models compare with respect to the Predicted \(R^{2}\) criterion.

The Best Subsets output for Example 10-4

Response is y

Vars R-Sq R-Sq
S x x x x
1 2 3 4
1 67.5 64.5 56.0 138.7 8.9639       X
1 66.6 63.6 55.7 142.5 9.0771   X    
2 97.9 97.4 96.5 2.7 2.4063 X X    
2 97.2 96.7 95.5 5.5 2.7343 X     X
3 98.2 97.6 96.9 3.0 2.3087 X X   X
3 98.2 97.6 96.7 3.0 2.3232 X X X  
4 98.2 97.4 95.9 5.0 2.4460 X X X X

The highest value of R-Sq (pred) is 96.9, which occurs for the model with x1, 2, and 4. However, recall that this model exhibits substantial multicollinearity, so the simpler model with just 1 and 2, which has a value of R-Sq (pred) of 96.5, is preferable.

Best Subsets Regressions: PIQ versus Brain, Height, Weight

Response is PIQ

Vars R-Sq R-Sq
S Brain Height Weight
1 2 3
1 14.3 11.9 4.66 7.3 21.212 X    
1 0.9 0.0 0.0 13.8 22.810   X  
2 29.5 25.5 17.6 2.0 19.510 X X  
2 19.3 14.6 5.9 6.9 20.878 X   X
3 29.5 23.3 12.8 4.0 19.794 X X X

The Best Subsets output for Example 10-5

Based on R-sq pred, the best model is the one containing Brain and Height.

Best Subsets output from Example 10-6

Best Subsets Regressions: BP versus Age, Weight, BSA, Dur, Pulse, Stress

Response is BP

Vars R-Sq R-Sq
S Age Weight BSA Dur Pulse Stress
1 90.3 89.7 88.5 312.8 1.7405   X        
1 75.0 73.6 69.5 829.1 2.7903     X      
2 99.1 99.0 98.9 15.1 0.53269 X X        
2 92.0 91.0 89.3 256.6 1.6246   X       X
3 99.5 99.4 99.2 6.4 0.43705 X X X      
3 99.2 99.1 98.8 14.1 0.52012 X X     X  
4 99.5 99.4 99.2 6.4 0.42591 X X X X    
4 99.5 99.4 99.1 7.1 0.43500 X X X     X
5 99.6 99.4 99.1 7.0 0.42142 X X X   X X
5 99.5 99.4 99.2 7.7 0.43078 X X X X X  
6 99.6 99.4 99.1 7.0 0.40723 X X X X X X

Based on R-sq pred, the best models are the ones containing Age, Weight, and BSA; Age, Weight, BSA, and Duration; and Age, Weight, BSA, Duration, and Pulse.

Minitab 18

Minitab®  – Information Criteria and PRESS