##
Example: Vitamin C Study - Revisited
Section* *

Recall the Vitamin C study, a \(2 \times 2\) example from Lesson 3. From the earlier analysis, we learned that the treatment and response for this data are not independent:

\(G^2 = 4.87\), df = 1, p−value = 0.023,

And from the estimated odds ratio of 0.49, we noted that the odds of getting a cold were about twice as high for those who took the placebo, compared with those who took vitamin C. Next, we will consider the same question by fitting the log-linear model of independence.

There are (at least) two ways of fitting these models in SAS and R. First, we will see the specification for two procedures in SAS (PROC CATMOD and PROC GENMOD); see VitaminCLoglin.sas and VitaminCLoglin.R.

#### The PROC CATMOD procedure

##### INPUT:

```
proc catmod data=ski order=data;
weight count;
model treatment*response=_response_;
loglin treatment response;
run;
```

The MODEL statement (line) specifies that we are fitting the model on two variables treatment and response specified on the left-hand side of the MODEL statement. The MODEL statement only takes independent variables on its right-hand side. In the case of fitting a log-linear model, we need to use a keyword _response_ on the right-hand side.

The LOGLIN statement specifies that it's a model of independence because we are stating that the model has only main effects, one for the "treatment" variable and the other for the "response" variable, in this case.

##### OUTPUT:

Population Profiles and Response profiles

The responses we're modeling are four cell counts:

Population Profiles | |
---|---|

Sample | Sample Size |

1 | 279 |

Response Profiles | ||
---|---|---|

Response | treatment | response |

1 | placebo | cold |

2 | placebo | nocold |

3 | absorbic | cold |

4 | absorbic | nocold |

Maximum Likelihood Analysis of Variance

As with ANOVA, we are breaking down the variability across factors. More specifically, the main effect TREATMENT tests if the subjects are evenly distributed over the two levels of this variable. The null hypothesis assumes that they are, and the alternative assumes that they are not. Based on the output below, are the subjects evenly distributed across the placebo and the vitamin C conditions?

Maximum Likelihood Analysis of Variance | |||
---|---|---|---|

Source | DF | Chi-Square | Pr > ChiSq |

treatment | 1 | 0.00 | 0.9523 |

response | 1 | 98.12 | <.0001 |

Likelihood Ratio | 1 | 4.87 | 0.0273 |

The main effect RESPONSE tests if the subjects are evenly distributed over the two levels of this variable. The Wald statistic \(X^2=98.12\) with df = 1 indicates highly uneven distribution.

The LIKELIHOOD RATIO (or the deviance statistic) tells us about the overall fit of the model. Does this value look familiar? Does the model of independence fit well?

Analysis of Maximum Likelihood Estimates

Gives the estimates of the parameters. We focus on the estimation of odds.

Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|

Parameter | Estimate | Standard | Chi- | Pr > ChiSq | |

treatment | placebo | 0.00358 | 0.0599 | 0.00 | 0.9523 |

response | cold | -0.7856 | 0.0793 | 98.12 | <.0001 |

For example,

\(\lambda_1^A=\lambda_{placebo}^{TREATMENT}=0.00358=-\lambda_{ascobic}^{TREATMENT}=-\lambda_2^A\)

What are the odds of getting cold? Recall that PROC CATMOD, by default, uses zero-sum constraints. Based on this output the log odds are

\begin{align}

\text{log}(\mu_{i1}/\mu_{i2}) &=\text{log}(\mu_{i1})-\text{log}(\mu_{i2})\\

&=[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{cold}^{RESPONSE}]-[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{nocold}^{RESPONSE}]\\

&=\lambda_{cold}^{RESPONSE}-\lambda_{nocold}^{RESPONSE}\\

&=\lambda_{cold}^{RESPONSE}+\lambda_{cold}^{RESPONSE}\\

&=-0.7856-0.7856=-1.5712\\

\end{align}So the odds are \(\exp(-1.5712) = 0.208\). Notice that the odds are the same for all rows.

#### Think about it!

What are the odds of being in the placebo group? Note that this question is not really relevant here, but it's a good exercise to figure out the model parameterization and how we come up with estimates of the odds. Hint: log(odds)=0.00716.

#### The PROC GENMOD procedure

Fits a log-linear model as a Generalized Linear Model (GLM) and by default uses indicator variable coding. We will mostly use PROC GENMOD in this course.

##### INPUT:

```
proc genmod data=ski order=data;
class treatment response;
model count = response*treatment /link=log dist=poisson lrci type3 obstats;
run;
```

The CLASS statement (line) specifies that we are dealing with two categorical variables: treatment and response.

The MODEL statement (line) specifies that our responses are the cell counts, and link=log specifies that we are modeling the log of counts. The random distribution is Poisson, and the remaining specifications ask for different parts of the output to be printed.

##### OUTPUT:

Model Information

Gives assumptions about the model. In our case, tells us that we are modeling the log, i.e., the LINK FUNCTION, of the responses, i.e. DEPENDENT VARIABLE, that is counts which have a Poisson distribution.

Model Information | |
---|---|

Data Set | WORK.SKI |

Distribution | Poisson |

Link Function | Log |

Dependent Variable | count |

Number of Observations Read | 4 |
---|---|

Number of Observations Used | 4 |

Class Level Information

Information on the categorical variables in our model

Class Level Information | ||
---|---|---|

Class | Levels | Values |

treatment | 2 | placebo absorbic |

response | 2 | cold nocold |

Criteria For Assessing Goodness of Fit

Gives the information on the convergence of the algorithm for MLE for the parameters, and how well the model fits. What is the value of the deviance statistic \(G^2\)? What about the Pearson \(X^2\)? Note that they are the same as what we got when we did the chi-square test of independence; we will discuss "scaled" options later with the concept of overdispersion. Note also the value of the log-likelihood for the fitted model, which in this case is independence.

Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|

Criterion | DF | Value | Value/DF |

Deviance | 1 | 4.8717 | 4.8717 |

Scaled Deviance | 1 | 4.8717 | 4.8717 |

Pearson Chi-Square | 1 | 4.8114 | 4.8114 |

Scaled Pearson X2 | 1 | 4.8114 | 4.8114 |

Log Likelihood | 970.6299 |

Analysis of Parameter Estimates

Gives individual parameter estimates. Recall that PROC GENMOD does not use zero-sum constraints but fixes typically the last level of each variable to be equal to zero:

\(\hat{\lambda}=4.75,\ \hat{\lambda}^A_1=0.0072,\ \hat{\lambda}^A_2=0,\ \hat{\lambda}^B_1=-1.5712,\ \hat{\lambda}^B_2=0\)

Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|

Parameter | DF | Estimate | Standard | Likelihood Ratio 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||

Intercept | 1 | 4.7457 | 0.0891 | 4.5663 | 4.9158 | 2836.81 | <.0001 | |

treatment | placebo | 1 | 0.0072 | 0.1197 | -0.2277 | 0.2422 | 0.00 | 0.9523 |

treatment | absorbic | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |

response | cold | 1 | -1.5712 | 0.1586 | -1.8934 | -1.2702 | 98.11 | <.0001 |

response | nocold | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |

Scale | 0 | 1.0000 | 0.0000 | 1.0000 | 1.0000 |

The scale parameter was held fixed.

So, we can compute the estimated cell counts for having cold given vitamin C based on this model:

\(\mu_{21}=\exp(\lambda+\lambda_2^A+\lambda_1^B)=\exp(4.745+0-1.5712)=23.91 \)

What about the other cells: \(\mu_{11}\), \(\mu_{12}\), and \(\mu_{22}\)?

What are the estimated odds of getting a cold?

\(\exp(\lambda_1^B-\lambda_2^B)=\exp(-1.571-0)=0.208\)

What about the odds ratio?

LR Statistics for Type 3 Analysis

Testing the main effects is similar to the Maximum Likelihood Analysis of Variance in the CATMOD, except this is the likelihood ratio (LR) statistic and not the Wald statistic. When the sample size is moderate or small, the likelihood ratio statistic is the better choice. From the CATMOD output, the chi-square Wald statistic was 98.12, with df=1, and we reject the null hypothesis. Here, we reach the same conclusion, except, that LR statistic is 130.59 with df=1.

LR Statistics For Type 3 Analysis | |||
---|---|---|---|

Source | DF | Chi-Square | Pr > ChiSq |

treatment | 1 | 0.00 | 0.9523 |

response | 1 | 130.59 | <.0001 |

Observation Statistics

Gives information on expected cell counts and five different residuals for further assessment of the model fit (i.e., "Resraw"=observed count - predicted (expected) count, "Reschi"=pearson residual, "Resdev"= deviance residuals, "StReschi"=standardized pearson residuals, etc.)

Observation Statistics | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Observation | count | treatment | response | Predicted Value | Linear Predictor | Standard Error of the Linear Predictor | HessWgt | Lower | Upper | Raw Residual | Pearson Residual | Deviance Residual | Std Deviance Residual | Std Pearson Residual | Likelihood Residual | Leverage | CookD | DFBETA_Intercept | DFBETA_treatmentplacebo | DFBETA_responsecold | DFBETAS_Intercept | DFBETAS_treatmentplacebo | DFBETAS_responsecold |

1 | 31 | placebo | cold | 24.086031 | 3.1816321 | 0.1561792 | 24.086031 | 17.734757 | 32.711861 | 6.9139686 | 1.4087852 | 1.3483626 | 2.0994112 | 2.1934897 | 2.1551805 | 0.5875053 | 2.2842489 | -0.060077 | 0.1197239 | 0.3491947 | -0.674251 | 0.9998856 | 2.2013658 |

2 | 109 | placebo | nocold | 115.91398 | 4.7528483 | 0.0888123 | 115.91398 | 97.395437 | 137.95359 | -6.913978 | -0.642185 | -0.648733 | -2.215859 | -2.193493 | -2.195419 | 0.9142868 | 17.107471 | -0.060077 | -0.576172 | 0.3491952 | -0.674252 | -4.811954 | 2.2013689 |

3 | 17 | absorbic | cold | 23.913989 | 3.1744636 | 0.1563437 | 23.913989 | 17.602407 | 32.488674 | -6.913989 | -1.413848 | -1.491801 | -2.314435 | -2.193496 | -2.244533 | 0.5845377 | 2.2564902 | -0.060077 | 0.1197243 | -0.346701 | -0.674253 | 0.9998885 | -2.185648 |

4 | 122 | absorbic | nocold | 115.08602 | 4.7456799 | 0.0891012 | 115.08602 | 96.645029 | 137.04577 | 6.913978 | 0.6444908 | 0.638194 | 2.172062 | 2.1934927 | 2.1916509 | 0.9136701 | 16.973819 | 0.6358195 | -0.576172 | -0.346701 | 7.1359271 | -4.811954 | -2.185645 |

#### The LOGLIN and GLM functions in R

To do the same analysis in R, the `loglin()`

* *function corresponds roughly to PROC CATMOD,* *and `glm()`

function corresponds roughly to PROC GENMOD. See VitaminCLoglin.R for the commands and to generate the output.

Here is a brief comparison to loglin() and glm() output with respect to model estimates.

The following command fits the log-linear model of independence, and from the part of the output that gives parameter estimates, we see that they correspond to the output from CATMOD. For example, the odds of getting a cold are estimated to be

\(\exp(\lambda_{cold}^{response} - \lambda_{no cold}^{response})=\exp(-0.7856-0.7856)=\exp(-1.5712)=0.208\)

```
loglin(ski, list(1, 2), fit=TRUE, param=TRUE)
$param$"(Intercept)"
[1] 3.963656
$param$Treatment
Placebo VitaminC
0.003584245 -0.003584245
$param$Cold
Cold NoCold
-0.7856083 0.7856083
```

The following command fits the log-linear model of independence using `glm()`

, and from the part of the output that gives parameter estimates, we see that they correspond to the output from GENMOD but with the signs reversed, i.e., odds of \(nocold=\exp(\lambda_{nocold}^{response})=\exp(1.5712)\). Thus, the odds of cold are \(\exp(-1.5712)\).

```
glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold, family=poisson())
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.181632 0.156179 20.372 <2e-16 ***
ski.data$TreatmentVitaminC -0.007168 0.119738 -0.060 0.952
ski.data$ColdNoCold 1.571217 0.158626 9.905 <2e-16 ***
```

#### Assessing Goodness of Fit

The ANOVA function gives information on how well the model fits. The value of the deviance statistic is \(G^2=4.872\) with 1 degree of freedom, which is significant evidence to reject this model. Note that this is the same as what we got when we did the chi-square test of independence; we will discuss options for dealing with overdispersion later as well.

```
> ski.ind
Call: glm(formula = ski.data$Freq ~ ski.data$Treatment + ski.data$Cold, family = poisson())
Coefficients:
(Intercept) ski.data$TreatmentVitaminC ski.data$ColdNoCold
3.181632 -0.007168 1.571217
Degrees of Freedom: 3 Total (i.e. Null); 1 Residual
Null Deviance: 135.5
Residual Deviance: 4.872 AIC: 34
```