6.2.4  Multilevel Predictor
6.2.4  Multilevel PredictorThe concepts discussed with binary predictors extend to predictors with multiple levels. In this lesson we consider \(Y_i\) a binary response, \(x_i\) a discrete explanatory variable (with \(k = 3\) levels, and make connections to the analysis of \(2\times3\) tables. But the basic ideas extend to any \(2\times J\) table.
We begin by replicating the analysis of the original \(3\times2\) table with logistic regression.
How many parents smoke?  Student smokes?  

Yes (Z = 1)  No (Z = 2)  
Both (Y = 1)  400  1380 
One (Y = 2)  416  1823 
Neither (Y = 3)  188  1168 
First, we reexpress the data in terms of \(Y_i=\) number of smoking students, and \(n_i=\) number of students for the three groups based on parents behavior (we can think of \(i\) as an index for the rows in the table).
How many parents smoke?  Student smokes?  

\(y_i\)  \(n_i\)  
Both  400  1780 
One  416  2239 
Neither  188  1356 
Then we decide on a baseline level for the explanatory variable \(x\) and create \(k − 1\)indicators if \(x\) is a categorical variable with \(k\) levels. For our example, we set "Neither" as the baseline for the parent smoking predictor and define a pair of indicators that take one of two values, specifically,
\(x_{1i}=1\) if parent smoking = "One" ,
\(x_{1i}=0\) otherwise
\(x_{2i}=1\) if parent smoking = "Both",
\(x_{2i}=0\) otherwise.
If these two indicator variables were added as columns in the table above, \(x_{1i}\) would have values \((0,1,0)\), and \(x_{2i}\) would have values \((1,0,0)\). Next, we let \(\pi_i\) denote the probability of student smoking for the \(i\)th row group so that finally, the model is
\(\log\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)
for \(i=1,\ldots,3\). Thus, the logodds of a student smoking is \(\beta_0\) for students whose parents don't smoke ("neither" group), \(\beta_0+\beta_1\) for students with one smoking parent ("one" group), and \(\beta_0+\beta_2\) for students with both smoking parents. In particular, \(\beta_1\) is the difference in log odds or, equivalently, the log odds ratio for smoking when comparing students with one smoking parent against students with neither smoking parent. Similarly, \(\beta_2\) represents that when comparing students with two smoking parents against students with neither smoking parent.
Based on the tabulated data, where we saw \(1.42=416 ( 1168)/(188( 1823))\) and \(1.80=400(1168)/(188(1380))\), we're not surprised to see \(\hat{\beta}_1=\log(1.42)=0.351\) and \(\hat{\beta}_2=\log(1.80)=0.588\). The estimated intercept should be \(\hat{\beta}_0=\log(188/1168)=1.826\).
Fitting A Multilevel Predictor Model in SAS and R
There are different models fit within the smoke.sas SAS code. Here is one where '0' = neither parent smokes, '1' = one smokes, and '2' = both smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.
data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic data=smoke descending;
class s (ref=first)/ param=ref;
model y/n = s / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression fro 2x3 tables with residuals';
run;
data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1prob);
run;
proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;
The option param=ref
tells SAS to create a set of two dummy variables to distinguish among the three categories, where '0'=neither is a baseline because of option descending
and ref=first
(see the previous section for details).
Another way of doing the same in smoke.sas but using character labels (e.g, "both", "one", "neither") rather than numbers for the categories:
data smoke;
input s $ y n ;
cards;
both 400 1780
one 416 2239
neither 188 1356
;
proc logistic data=smoke descending;
class s (ref='neither') / order=data param=ref;
model y/n = s /scale=none lackfit;
output out=predict pred=prob;
title 'Logistic regression for 2x3 table';
run;
proc print data=predict;
run;
In the class
statement, the option order=data
tells SAS to sort the categories of S by the order in which they appear in the dataset rather than alphabetical order. The option ref='neither'
makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's look at some relevant portions of the output that differ from the analysis of the corresponding \(2\times2\) table from the previous section of the notes.
Model Information  

Data Set  WORK.SMOKE 
Response Variable (Events)  y 
Response Variable (Trials)  n 
Model  binary logit 
Optimization Technique  Fisher's scoring 
Number of Observations Read  3 

Fisher scoring is a variant of NewtonRaphson method for ML estimation. In logistic regression they are equivalent. Notice how there are 3 observations since we have 3 groupings by the levels of the explanatory variable.
Class Level Information  

Class  Value  Design Variables  
s  0  0  0 
1  1  0  
2  0  1 
From an explanatory variable S with 3 levels (0,1,2), we created two indicator variables:
\(x_1=1\) if parent smoking = One,
\(x_1=0\) otherwise,
\(x_2=1\) if parent smoking=Both,
\(x_2=0\) otherwise.
Since parent smoking = Neither is equal to 0 for both indicator variables, it serves as the baseline.
Class Level Information  

Class  Value  Design Variables  
s  both  1  0 
one  0  1  
neither  0  0 
Here, we specified neither to be a reference variable. We used option data=order
so that both take values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based on alphabetical order, but in this case, they coincide.
The parameter estimates are reported in the output below.
Analysis of Maximum Likelihood Estimates  

Parameter  DF  Estimate  Standard Error 
Wald ChiSquare 
Pr > ChiSq  
Intercept  1  1.8266  0.0786  540.2949  <.0001  
s  1  1  0.3491  0.0955  13.3481  0.0003 
s  2  1  0.5882  0.0970  36.8105  <.0001 
Here, \(s_1\) and \(s_2\) correspond to \(x_1\) and \(x_2\). The saturated model is
\(\mbox{logit}(\pi)=1.8266+0.3491 x_1+0.5882 x_2\)
and the estimated probability of a child smoking, given the explanatory variable is
\(\hat{\pi}_i=\dfrac{\exp(1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(1.8266+0.3491 x_1+0.5882 x_2)}\)
For example, the predicted probability of a student smoking, given that only one parent is smoking is
\(P(Y=1x_1=1,x_2=0)= \dfrac{\exp(1.8266+0.3491)}{1+\exp(1.8266+0.3491)}= 0.1858\)
Residuals
In SAS, if we include the statement
output out=predict pred=phat reschi=pearson resdev=deviance;
then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), and the Pearson and deviance residuals. We can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) (e.g., "shat") and failures \(n_i\hat{\mu}_i=n_i(1\hat{\pi}_i)\) (e.g., "fhat").
data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1prob);
run;
proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;
Running this program gives a new output section:
Obs  s  y  n  prob  shat  fhat  pearson  deviance 

1  2  400  1780  0.22472  400.000  1380.00  .000000031  0 
2  1  416  2239  0.18580  416.000  1823.00  3.0886E15  0 
3  0  188  1356  0.13864  188.000  1168.00  .000001291  .000001196 
Here "s" are the levels of the categorical predictor for parents' smoking behavior, "y" as before the number of students smoking for each level of the predictor, "n" the marginal counts for each level of the predictor", "prob" is the estimated probability of "success" (e.g. a student smoking given the level of the predictor), "shat" and "fhat" expected number of successes and failures respectively, and "pearson" and "deviance" are Pearson and Deviance residuals.
All of the "shat" and "fhat" values, that is predicted number of successes and failures are greater than 5.0, so the chisquare approximation is trustworthy.
Below is the R code (from smoke.R) that replicates the analysis of the original \(2\times3\) table with logistic regression.
#### Fitting logistic regression for a 2x3 table #####
#### Here is one way to read the data from the table and use glm()
#### Notice how we need to use family=binomial (link=logit)
#### while with loglinear models we used family=poisson(link=log)
parentsmoke=as.factor(c(2,1,0))
response<cbind(c(400,416,188),c(1380,1823,1168))
response
smoke.logistic<glm(response~parentsmoke, family=binomial(link=logit))
First, let’s see the table we created for the analysis.
> response
[,1] [,2]
[1,] 400 1380
[2,] 416 1823
[3,] 188 1168
You should again notice the difference in data input. In SAS, we input y and n as data columns; here we just input data columns as yes and no.
Next, the model information is displayed.
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e09 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.8366e+01 on 2 degrees of freedom
Residual deviance: 3.7215e13 on 0 degrees of freedom
AIC: 28.165
Number of Fisher Scoring iterations: 2
If we don’t specify the baseline using the "class" option, R will set the first level as the default. Here, it’s "0". The parentsmoke1 and parentsmoke2 variables correspond to the \(x_1\) and \(x_2\) indicators. The saturated model is
\(\mbox{logit}(\pi)=1.8266+0.3491 x_1+0.5882 x_2\)
and the estimated probability of a child smoking given the explanatory variable:
\(\hat{\pi}_i=\dfrac{\exp(1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(1.8266+0.3491 x_1+0.5882 x_2)}\)
For example, the predicted probability of a student smoking given that only one parent is smoking is
\(P(Y=1x_1=1,x_2=0)= \dfrac{\exp(1.8266+0.3491)}{1+\exp(1.8266+0.3491)}= 0.1858\)
Residuals
In R, deviance residuals are given directly in the output by summary() function.
> smoke.logistic<glm(response~parentsmoke, family=binomial(link=logit))
> summary(smoke.logistic)
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e09 ***

We can also obtain the deviance and Pearson residuals by using residuals.glm() function.
To obtain deviance residuals
> residuals(smoke.logistic)
[1] 0 0 0
To obtain Person residuals:
> residuals(smoke.logistic, type="pearson")
1 2 3
2.440787e13 4.355932e13 1.477321e11
To obtain the predicted values, use the function predict.glm()
with which we can specify the type of predicted values we want.
type
is the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model, the default predictions are of logodds (probabilities on logit scale) and type = "response"
gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
For example, the code below gives predicted probabilities.
> predict(smoke.logistic, type="response")
1 2 3
0.2247191 0.1857972 0.1386431
All of the predicted number of successes and failures are greater than 5.0 so the chisquare approximation is trustworthy.
Overall goodnessoffit
The goodnessoffit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Suppose that we fit the interceptonly model as before by removing the predictors from the model statement:
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression for interceptonly model';
run;
The goodnessoffit statistics are shown below.
Deviance and Pearson GoodnessofFit Statistics  

Criterion  Value  DF  Value/DF  Pr > ChiSq 
Deviance  38.3658  2  19.1829  <.0001 
Pearson  37.5663  2  18.7832  <.0001 
Number of events/trials observations: 3
The Pearson statistic \(X^2= 37.5663\) and the deviance \(G^2 = 38.3658\) are precisely equal to the usual \(X^2\) and \(G^2\) for testing independence in the \(2\times 3\) table. In this example, the saturated model fits perfectly (as always), but the independence model does not fit well.
We can use the HosmerLemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
Partition for the Hosmer and Lemeshow Test  

Group  Total  Event  Nonevent  
Observed  Expected  Observed  Expected  
1  1356  188  188.00  1168  1168.00 
2  2239  416  416.00  1823  1823.00 
3  1780  400  400.00  1380  1380.00 
Hosmer and Lemeshow GoodnessofFit Test  

ChiSquare  DF  Pr > ChiSq 
0.0000  1  1.0000 
Null deviance: 3.8366e+01 on 2 degrees of freedom
Residual deviance: 3.7215e13 on 0 degrees of freedom
AIC: 28.165
The residual deviance is almost 0 because the model is saturated. We can find \(G^2= 38.366\) by null deviance, which is precisely equal to the usual \(G^2\) for testing independence in the \(2\times 3\) table. Since this statistic is large, it leads to a small pvalue and provides significant evidence against the interceptonly model in favor of the current model. We can also find the same result in the output from the anova()
part. Clearly, in this example, the saturated model fits perfectly (as always), but the independence model does not fit well.
We can use the HosmerLemeshow test (available in the R file HosmerLemeshow.R) to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
> ## hosmerlem() takes the vector of successes, predicted vector of success and g=# of groups as input
> ## produce the vector of predicted success "yhat"
> yhat=rowSums(response)*predict(smoke.logistic, type="response")
> yhat
1 2 3
400 416 188
> hosmerlem(response[,1], yhat, g=3) ## here run 3 groups
X^2 Df P(>Chi)
"1.00593633284243e24" "1" "."
We have shown that analyzing a \(2\times 3\) table for associations is equivalent to a binary logistic regression with two dummy variables as predictors. For \(2\times J\) tables, we would fit a binary logistic regression with \(J − 1\) indicator variables.
Testing the Joint Significance of All Predictors
Starting with the (full) model
\(\log\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)
the null hypothesis \(H_0\colon\beta_1=\beta_2=0\) specifies the interceptonly (reduced) model:
\(\log\left(\dfrac{\pi}{1\pi}\right)=\beta_0\)
In general, this test has degrees of freedom equal to the number of slope parameters, which is 2 in this case. Large chisquare statistics lead to small pvalues and provide evidence to reject the interceptonly model in favor of the full model.
Testing Global Null Hypothesis: BETA=0  

Test  ChiSquare  DF  Pr > ChiSq 
Likelihood Ratio  38.3658  2  <.0001 
Score  37.5663  2  <.0001 
Wald  37.0861  2  <.0001 
Testing for an Arbitrary Group of Coefficients
The likelihoodratio statistic is
\(G^2 = −2 (\log \mbox{ likelihood from reduced model}−(−2 \log \mbox{ likelihood from full model}))\)
and the degrees of freedom is \(k=\) the number of parameters differentiating the two models. The pvalue is \(P(\chi^2_k \geq G^2)\).
Model Fit Statistics  

Criterion  Intercept Only  Intercept and Covariates  
Log Likelihood  Full Log Likelihood  
AIC  5178.510  5144.144  28.165 
SC  5185.100  5163.913  47.933 
2 Log L  5176.510  5138.144  22.165 
For our example, \(G^2 = 5176.510 − 5138.144 = 38.3658\) with \(3 − 1 = 2\) degrees of freedom. Notice that this matches
Likelihood Ratio 38.3658 2 <.0001
from the "Testing Global Hypothesis: BETA=0" section. Here is the model we just looked at in SAS.
data smoke;
input s1 s2 $ y n ;
cards;
0 1 400 1780
1 0 416 2239
0 0 188 1356
;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 s2 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;
Compare this to
model y/n = s1 /scale=none lackfit;
Testing Individual Parameters
For the test of the significance of a single variable \(x_j\)
\(H_0\colon\beta_j=0\) versus \(H_A\colon\beta_j\ne0\)
we can use the (Wald) test statistic and pvalue. A large value of \(z\) (relative to standard normal) or \(z^2\) (relative to chisquare with 1 degree of freedom) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.
From the second row of this part of the output,
Analysis of Maximum Likelihood Estimates  

Parameter  DF  Estimate  Standard Error 
Wald ChiSquare 
Pr > ChiSq  
Intercept  1  1.8266  0.0786  540.2949  <.0001  
s  1  1  0.3491  0.0955  13.3481  0.0003 
s  2  1  0.5882  0.0970  36.8105  <.0001 
\(z^2=\left(\dfrac{0.3491}{0.0955}\right)^2=13.3481\)
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e09 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A large value of \(z\) (relative to standard normal) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.
Confidence Intervals: An approximate \((1 − \alpha)100\)% confidence interval for \(\beta_j\) is given by
\(\hat{\beta}_j \pm z_{(1\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% CI for \(\beta1\) is
\(0.3491 \pm 1.96 (0.0955) = (0.16192, 0.5368)\)
Then, the 95% CI for the oddsratio of a student smoking, if one parent is smoking in comparison to neither smoking, is
\((\exp(0.16192), \exp(0.5368)) = (1.176, 1.710)\)
Since this interval does not include the value 1, we can conclude that student and parents' smoking behaviors are associated. Furthermore,
Odds Ratio Estimates  

Effect  Point Estimate  95% Wald Confidence Limits 

s 1 vs 0  1.418  1.176  1.710 
s 2 vs 0  1.801  1.489  2.178 
 The estimated conditional odds ratio of a student smoking between one parent smoking and neither smoking is \(\exp(\beta_1) = \exp(0.3491) = 1.418\).
 The estimated conditional odds ratio of a student smoking between both parents smoking and neither smoking is \(\exp(\beta_2) = \exp(0.5882) = 1.801\).
 The estimated conditional odds ratio of a student smoking between both parents smoking and one smoking is \(\exp(\beta_2)/\exp(\beta_1) = 1.801/1.418 = 1.27 = \exp(\beta_2\beta_1) = \exp(0.5882 − 0.3491) = 1.27\). That is, compared with a student who has only one parent smoking, a student who has both parents smoking has an odds of smoking 1.27 times as high.