3a.3 - Dummy Variable Regression

The GLM can be viewed from the regression perspective as an ordinary multiple linear regression (MLR) with ‘dummy’ coding (actually indicator coding) for the categorical treatment levels. Typically software performing the MLR will automatically include an intercept, which complicates the interpretation of the regression coefficients.

In IML, we now replace the design matrix with:

/* Dummy Variable Regression Model */
x = {
1	1	0,
1	1	0, 
1	0	1,
1	0	1,
1   	0   	0,
1   	0   	0};

We can see something strange with this design matrix. Although we have three treatment levels in this example, we have a column of ‘1’s’ for the intercept and only two columns that have been indicator coded for treatment levels.. The reason for this is that the complete matrix

\(\begin{bmatrix}
1 & 1 & 0 & 0\\
1 & 1 & 0 & 0\\
1 & 0 & 1 & 0\\
1 & 0 & 1 & 0\\
1 & 0 & 0 & 1\\
1 & 0 & 0 & 1
\end{bmatrix}\)

has the property that the sum of columns 2-4 will equal the first column for the intercept. As a result, a condition called singularity is created and the matrix computations will not run. So one of the treatment levels is omitted from the coding in our design matrix above for IML. It doesn’t matter which one we eliminate, although some refer to the eliminated level as a ‘reference’ level. Unlike logistic regression, the treatment level that we eliminate from the coding here does not act as a reference level for comparing other treatment levels. Its simply an algebraic manipulation. In the our design matrix for IML we have eliminated the indicator coding for treatment level3.

Re-running IML, we now get the following output;

Regression Coefficients
Beta_0 5.5
Beta_1 -4
Beta_2 -2

The coefficient for \(\beta_0\) is the mean for treatment level3. The mean for treatment level1 is then calculated from \(\beta_0+\beta_1=1.5\). Likewise, the mean for treatment level2 is calculated as \(\beta_0+\beta_2=3.5\).

Notice that the F statistic calculated from this model is the same as that produced from the Cell Means model.

ANOVA
  df SS MS F
Treatment 2 16 8 16
Error 3 1.5 0.5  
Total 5 17.5    

 Using Minitab

We can confirm our ANOVA table now by running the analysis is ordinary software such as Minitab, given that we can set the coding that the software uses. In Mintab, under the Stat > ANOVA > General Linear Model, we control this by specifying Indicator (1,0) coding:

 minitab dialog box for the general linear model

This produces the regular ANOVA output:

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
trt 2 16.000 8.0000 16.00 0.025
Error 3 1.500 0.5000    
Total 5 17.500      

And also the Regression Equation

Regression Equation

y = 5.500 - 4.000 trt_level1 - 2.000 trt_level2 + 0.0 trt_level3

 Using SAS

In SAS, the default coding is indicator coding, so when you specify the option

model y=trt / solution;

you get the regression coefficients:

Solution for Fixed Effects
Effect trt Estimate Standard Error DF t Value Pr > |t|
Intercept   5.5000 0.5000 3 11.00 0.0016
trt level1 -4.0000 0.7071 3 -5.66 0.0109
trt level2 -2.0000 0.7071 3 -2.83 0.0663
trt level3 0        

And the same ANOVA table:

Type 3 Analysis of Variance
Source DF Sum of Squares Mean Square Expected Mean Square Error Term Error DF F Value Pr > F
trt 2 16.000000 8.000000 Var(Residual)+Q(trt) MS(Residual) 3 16.00 0.0251
Residual 3 1.500000 0.500000 Var(Residual)        

The Intermediate calculations for this model are:

xprimex
6 2 2
2 2 0
2 0 2
check
1 -2.22E-16 0
3.331E-16 1 0
0 0 1
xprimey
21
3
7
SumY2
89.5
CF
73.5
xprimexinv
0.5 -0.5 -0.5
-0.5 1 0.5
-0.5 0.5 1