10.1.2 - Example: Therapeutic Value of Vitamin C

Printer-friendly versionPrinter-friendly version

Recall the Vitamin C study, a 2 × 2 example from Lesson 3. From the earlier analysis, we learned that the treatment and response for this data are not independent, e.g.,

(G2 = 4.87, df = 1, p−value = 0.023),

and we estimated various strength of associations, e.g. odds ratios 0.49, indicating that skiers who took placebo were twice more likely to get cold than the ones who took vitamin C.

Next we will consider how we answer the same question but by fitting the log-linear model of independence.

There are (at least) two ways of fitting these models in SAS and R. First, we will see the specification for two procedures in SAS (PROC CATMOD and PROC GENMOD); see VitaminCLoglin.sas and different parts of output in VitaminCLoglin.lst.  Furthermore below, you can see more on the corresponding R functions loglin() and glm(), by running the Inspect video, but I suggest that the R users also read sections on SAS; see VitaminCLoglin.R and different parts of output in VitaminC.Loglin.out.

sas logoThe PROC CATMOD procedure


SAS program

The MODEL statement (line) specifies that we are fitting the model on two variables treatment and response specified on the left-hand side of the MODEL statement. The MODEL statement only takes independent variables in its right-hand side. In the case of fitting a log-linear model, we need to use a keyword _response_ on the right-hand side.

The LOGLIN statement specifies that it's a model of independence, because we are stating that the model has only main effects, one for "treatment" variable and the other for the "response" variable, in this case.


Population Profiles and Response profiles
Our responses, that is what we are modeling, are four cell counts:
SAS output
SAS 	output
Response Matrix
Gives the parameter constraint coding, e.g., dummy or effect, -1,1, contrasts, etc....

SAS 	ouput
Maximum Likelihood Analysis
An iterative procedure is used to find MLE estimates for the model parameters.
SAS output
Maximum Likelihood Analysis of Variance
Like in the ANOVA, we are breaking down the variability across factors. More specifically, the main effect TREATMENT test if the subject are evenly distributed over the two levels of this variable. The null hypothesis assumes that they are versus alternative that they are not. Based on the output below, are the subjects evenly distributed across the placebo and the vitamin C conditions?

SAS output

The main effect RESPONSE test if the subjects are evenly distributed over the two levels of this variable. Wald χ2 = 98.12, df = 1 indicated highly uneven distribution.

The LIKELIHOOD RATIO (or the deviance statistic) tells us about the overall fit of the model. Does this value look familiar? Does the model of independence fit well?

Analysis of Maximum Likelihood Estimates
Gives the estimates of the parameters. We focus on estimation of odds.

SAS output

For example, \(\lambda_1^A=\lambda_{placebo}^{TREATMENT}=0.00358=-\lambda_{ascobic}^{TREATMENT}=-\lambda_2^A\)
What are the odds of getting cold? Recall that PROC CATMOD, by default, uses zero-sum constraints, and the signs for the parameters for each cell are expressed in the Response Matrix output. Based on that output (see the above output) the log odds are:
\text{log}(\mu_{i1}/\mu_{i2}) &=\text{log}(\mu_{i1})-\text{log}(\mu_{i2})\\
So the odds are exp(-1.5712) = 0.208. Notice that the odds are the same for all rows!

Discuss     What are the odds for being in the placebo group? Note that this question is not really relevant here, but it's a good exercise to figure out if you understand the matching between the response matrix contrasts and how do we come up with estimates of the odds. Hint: log(odds)=0.00716.

sas logoThe PROC GENMOD procedure

Fits a log-linear model as a Generalized Linear Model (GLM), and by default uses dummy coding. We will mostly use PROC GENMOD in this class. 


SAS program

The CLASS statement (line) specifies that we are dealing with two random variables, treatment and response.

The MODEL statement (line) specifies that our response are the cell counts, and link=log that we are modeling log of counts; two variables should form a model of independence, and error distribution, that is sampling distribution is Poisson; the remaining specifications ask for different parts of output to be printed on your screen or in the .lst document. For other options see SAS help.


Model Information
Gives assumptions about the model. In our case, tells us that we are modeling the log, i.e., the LINK FUNCTION, of the responses, i.e. DEPENDENT VARIABLE, that is counts which have Poisson distribution.

SAS output
Class Level Information
Information on the categorical variables in our model

SAS output
Criteria For Assessing Goodness of Fit
Gives the information on the convergence of the algorithm for MLE for the parameters, and how well the model fits. What is the value of the deviance statistic, i.e., G2? How about Pearson X2? Note that they are the same as we got when we did the chi-squared test of independence; we will discussed "scaled" options later in relation to the concept of overdispersion. Note also the value of the log likelihood for the fitted model, in this case the independence.

SAS output
Analysis of Parameter Estimates
Gives individual parameter estimates. Recall that PROC GENMOD does not uses zero-sum constraints, but fixes typically the last level of each variable to be equal to zero.
\(\hat{\lambda}=4.75,\ \hat{\lambda}^A_1=0.0072,\ \hat{\lambda}^A_2=0,\ \hat{\lambda}^B_1=-1.5712,\ \hat{\lambda}^B_2=0\)
SAS output
So what are the estimated cell counts for having cold given that you took vitamin C based on this model:

\(\mu_{21}=\text{exp}(\lambda+\lambda_2^A+\lambda_1^B)=\text{exp}(4.745+0-1.5712)=23.91 \)
How about the other cells: μ11,  μ12, and μ22?

What are the estimated odds of getting cold?

How about the odds ratio?

LR Statistics for Type 3 Analysis
Testing the main effects, as in Maximum Likelihood Analysis of Variance in the CATMOD, except this is the likelihood ratio (LR) statistic and not the Wald statistic. Do you recall the difference between the two from Lesson 2 in testing difference in proportions? Let the null hypothesis be that there is no main effect, or in this case that there is equal distribution of subjects across the levels of a particular categorical variable, e.g., response. Remember that the Wald statistic depends only on the behavior of log-likelihood at the point of ML estimate and its standard error; e.g., z=MLE/SE(MLE). The likelihood-ratio statistic depends on the behavior of log-likelihood at the point of the ML estimate but also at the point of the null hypothesis, e.g., -2(L0-L1). When the sample size is moderate or small, likelhood ratio statistic is better choice. From the CATMOD output, the chi-squared Wald statistic was 98.12, with df=1, and we reject the null hypothesis. Here, we reach the same conclusion, except, that LR statistic is 130.59, df=1.

SAS output
Observation Statistics
Gives information on expected cell counts (i.e., PRED), log of the expected counts (i.e., XBETA), standard errors (i.e., Std), confidence interval (i.e., Lower, Upper) and five different residuals for further assessment of the model fit, e.g., "Resraw"=observed count - predicted (expected) count, "Reschi"=pearson residual, "Resdev"= deviance residuals, "StReschi"=standardized pearson residuals, etc.
SAS output

R LogoThe LOGLIN and GLM functions in R

To do the same analysis in R we can use loglin() function that corresponds to PROC CATMOD  and glm() function that corresponds to PROC GENMOD. See, VitaminCLoglin.R and VitaminC.Loglin.out and the viewlet which runs step-by-step through the commands and the output.

(viewlet for VitaminCLoglin.R)

Here is a brief comparison to loglin() and glm() output with respect to model estimates.

The following command fits the log-linear model of independence, and from the part of the output that gives parameter estimates, we see that they correspond to the output from CATMOD, cold=exp(Λcoldresponsenocoldresponse)=exp(-0.7856-0.7856)=exp(-1.5712).

loglin(ski, list(1, 2), fit=TRUE, param=TRUE)
[1] 3.963656

Placebo VitaminC
0.003584245 -0.003584245

Cold NoCold
-0.7856083 0.7856083

The following command fits the log-linear model of independence using glm(), and from the part of the output that gives parameter estimates, we see that they correspond to the output from GENMOD, but with the signs reversed, e.g. odds of nocold=exp(Λnocoldresponse)=exp(1.5712), thus odds of cold exp(-1.5712).

glm(ski.data$Freq~ski.data$Treatment+ski.data$Cold, family=poisson())

Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.181632 0.156179 20.372 <2e-16 ***
ski.data$TreatmentVitaminC -0.007168 0.119738 -0.060 0.952
ski.data$ColdNoCold 1.571217 0.158626 9.905 <2e-16 ***