# 10.1.2 - Example: Therapeutic Value of Vitamin C

Recall the Vitamin C study, a 2 × 2 example from Lesson 3. From the earlier analysis, we learned that the treatment and response for this data are not independent, e.g.,

(*G*^{2} = 4.87, *df* = 1, *p*−value = 0.023),

and we estimated various strength of associations, e.g. odds ratios 0.49, indicating that skiers who took placebo were twice more likely to get cold than the ones who took vitamin C.

Next we will consider how we answer the same question but by fitting the log-linear model of independence.

There are (at least) two ways of fitting these models in SAS and R. First, we will see the specification for two procedures in SAS (PROC CATMOD and PROC GENMOD); see VitaminCLoglin.sas and different parts of output in VitaminCLoglin.lst. Furthermore below, you can see more on the corresponding R functions *loglin()* and *glm*(), by running the Inspect video, but I suggest that the R users also read sections on SAS; see VitaminCLoglin.R and different parts of output in VitaminC.Loglin.out.

#### The PROC CATMOD procedure

INPUT:

The MODEL statement (line) specifies that we are fitting the model on two variables * treatment* and * response* specified on the left-hand side of the MODEL statement. The MODEL statement only takes independent variables in its right-hand side. In the case of fitting a log-linear model, we need to use a keyword *_response_* on the right-hand side.

The LOGLIN statement specifies that it's a model of independence, because we are stating that the model has only main effects, one for "treatment" variable and the other for the "response" variable, in this case.

OUTPUT:

*Population Profiles and Response profiles*- Our responses, that is what we are modeling, are four cell counts:
*Response Matrix*- Gives the parameter constraint coding, e.g., dummy or effect, -1,1, contrasts, etc....

*Maximum Likelihood Analysis*- An iterative procedure is used to find MLE estimates for the model parameters.

*Maximum Likelihood Analysis of Variance*- Like in the ANOVA, we are breaking down the variability across factors. More specifically, the main effect TREATMENT test if the subject are evenly distributed over the two levels of this variable. The null hypothesis assumes that they are versus alternative that they are not. Based on the output below, are the subjects evenly distributed across the placebo and the vitamin C conditions?

- The main effect RESPONSE test if the subjects are evenly distributed over the two levels of this variable. Wald χ
^{2}= 98.12,*df*= 1 indicated highly uneven distribution. - The LIKELIHOOD RATIO (or the deviance statistic) tells us about the overall fit of the model. Does this value look familiar? Does the model of independence fit well?

*Analysis of Maximum Likelihood Estimates*- Gives the estimates of the parameters. We focus on estimation of odds.

- For example, \(\lambda_1^A=\lambda_{placebo}^{TREATMENT}=0.00358=-\lambda_{ascobic}^{TREATMENT}=-\lambda_2^A\)
- What are the odds of getting cold? Recall that PROC CATMOD, by default, uses zero-sum constraints, and the signs for the parameters for each cell are expressed in the
*Response Matrix*output. Based on that output (see the above output) the log odds are: - \begin{align}

\text{log}(\mu_{i1}/\mu_{i2}) &=\text{log}(\mu_{i1})-\text{log}(\mu_{i2})\\

&=[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{cold}^{RESPONSE}]-[\lambda+\lambda_{placebo}^{TREATMENT}+\lambda_{nocold}^{RESPONSE}]\\

&=\lambda_{cold}^{RESPONSE}-\lambda_{nocold}^{RESPONSE}\\

&=\lambda_{cold}^{RESPONSE}+\lambda_{cold}^{RESPONSE}\\

&=-0.7856-0.7856=-1.5712\\

\end{align} - So the odds are
*exp*(-1.5712) = 0.208. Notice that the odds are the same for all rows!

What are the odds for being in the placebo group? Note that this question is not really relevant here, but it's a good exercise to figure out if you understand the matching between the response matrix contrasts and how do we come up with estimates of the odds. Hint: log(odds)=0.00716. |

#### The PROC GENMOD procedure

Fits a log-linear model as a Generalized Linear Model (GLM), and by default uses dummy coding. We will mostly use PROC GENMOD in this class.

INPUT:

The CLASS statement (line) specifies that we are dealing with two random variables,* treatment* and *response*.

The MODEL statement (line) specifies that our response are the cell counts, and *link=log* that we are modeling log of counts; two variables should form a model of independence, and error distribution, that is sampling distribution is Poisson; the remaining specifications ask for different parts of output to be printed on your screen or in the .lst document. For other options see SAS help.

OUTPUT:

*Model Information*- Gives assumptions about the model. In our case, tells us that we are modeling the log, i.e., the LINK FUNCTION, of the responses, i.e. DEPENDENT VARIABLE, that is counts which have Poisson distribution.

*Class Level Information*- Information on the categorical variables in our model

*Criteria For Assessing Goodness of Fit*- Gives the information on the convergence of the algorithm for MLE for the parameters, and how well the model fits. What is the value of the deviance statistic, i.e.,
*G*? How about Pearson^{2}*X*? Note that they are the same as we got when we did the chi-squared test of independence; we will discussed "scaled" options later in relation to the concept of overdispersion. Note also the value of the^{2}*log likelihood*for the fitted model, in this case the independence.

*Analysis of Parameter Estimates*- Gives individual parameter estimates. Recall that PROC GENMOD does not uses zero-sum constraints, but fixes typically the last level of each variable to be equal to zero.
- \(\hat{\lambda}=4.75,\ \hat{\lambda}^A_1=0.0072,\ \hat{\lambda}^A_2=0,\ \hat{\lambda}^B_1=-1.5712,\ \hat{\lambda}^B_2=0\)
- So what are the estimated cell counts for having cold given that you took vitamin C based on this model:
- \(\mu_{21}=\text{exp}(\lambda+\lambda_2^A+\lambda_1^B)=\text{exp}(4.745+0-1.5712)=23.91 \)
- How about the other cells: μ
_{11, }μ_{12, and }μ_{22}? - What are the
**estimated odds**of getting cold? - \(\text{exp}(\lambda_1^B-\lambda_2^B)=\text{exp}(-1.571-0)=0.208\)
- How about the
**odds ratio**?

*LR Statistics for Type 3 Analysis*- Testing the main effects, as in
*Maximum Likelihood Analysis of Variance*in the CATMOD, except this is the likelihood ratio (LR) statistic and not the Wald statistic. Do you recall the difference between the two from Lesson 2 in testing difference in proportions? Let the null hypothesis be that there is no main effect, or in this case that there is equal distribution of subjects across the levels of a particular categorical variable, e.g., response. Remember that the Wald statistic depends only on the behavior of log-likelihood at the point of ML estimate and its standard error; e.g., z=MLE/SE(MLE). The likelihood-ratio statistic depends on the behavior of log-likelihood at the point of the ML estimate but also at the point of the null hypothesis, e.g., -2(L_{0}-L_{1}). When the sample size is moderate or small, likelhood ratio statistic is better choice. From the CATMOD output, the chi-squared Wald statistic was 98.12, with df=1, and we reject the null hypothesis. Here, we reach the same conclusion, except, that LR statistic is 130.59, df=1.

*Observation Statistics*- Gives information on expected cell counts (i.e., PRED), log of the expected counts (i.e., XBETA), standard errors (i.e., Std), confidence interval (i.e., Lower, Upper) and five different residuals for further assessment of the model fit, e.g., "Resraw"=observed count - predicted (expected) count, "Reschi"=pearson residual, "Resdev"= deviance residuals, "StReschi"=standardized pearson residuals, etc.

#### The LOGLIN and GLM functions in R

To do the same analysis in R we can use loglin()* *function that corresponds to PROC CATMOD* * and glm() function that corresponds to PROC GENMOD. See, VitaminCLoglin.R and VitaminC.Loglin.out and the viewlet which runs step-by-step through the commands and the output.

(viewlet for VitaminCLoglin.R)

Here is a brief comparison to loglin() and glm() output with respect to model estimates.

The following command fits the log-linear model of independence, and from the part of the output that gives parameter estimates, we see that they correspond to the output from CATMOD, cold=exp(*Λ _{cold}^{response}+Λ_{nocold}*

^{response})=exp(-0.7856-0.7856)=exp(-1.5712).

loglin(ski, list(1, 2), fit=TRUE, param=TRUE)

$param$"(Intercept)"

[1] 3.963656

$param$Treatment

Placebo VitaminC

0.003584245 -0.003584245

$param$Cold

Cold NoCold

-0.7856083 0.7856083

The following command fits the log-linear model of independence using glm(), and from the part of the output that gives parameter estimates, we see that they correspond to the output from GENMOD, but with the signs reversed, e.g. odds of nocold=exp(*Λ _{nocold}^{response}*)=exp(1.5712), thus odds of cold exp(-1.5712).

`glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold, family=poisson())`

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.181632 0.156179 20.372 <2e-16 ***

ski.data$TreatmentVitaminC -0.007168 0.119738 -0.060 0.952

ski.data$ColdNoCold 1.571217 0.158626 9.905 <2e-16 ***