6.2.4 - Multi-level Predictor

6.2.4 - Multi-level Predictor

The concepts discussed with binary predictors extend to predictors with multiple levels. In this lesson we consider \(Y_i\) a binary response, \(x_i\) a discrete explanatory variable (with \(k = 3\) levels, and make connections to the analysis of \(2\times3\) tables. But the basic ideas extend to any \(2\times J\) table.

We begin by replicating the analysis of the original \(3\times2\) table with logistic regression.

How many parents smoke? Student smokes?
Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168

First, we re-express the data in terms of \(Y_i=\) number of smoking students, and \(n_i=\) number of students for the three groups based on parents behavior (we can think of \(i\) as an index for the rows in the table).

How many parents smoke? Student smokes?
\(y_i\) \(n_i\)
Both 400 1780
One 416 2239
Neither 188 1356

Then we decide on a baseline level for the explanatory variable \(x\) and create \(k − 1\)indicators if \(x\) is a categorical variable with \(k\) levels. For our example, we set "Neither" as the baseline for the parent smoking predictor and define a pair of indicators that take one of two values, specifically,

\(x_{1i}=1\) if parent smoking = "One" ,
\(x_{1i}=0\) otherwise

\(x_{2i}=1\) if parent smoking = "Both",
\(x_{2i}=0\) otherwise.

If these two indicator variables were added as columns in the table above, \(x_{1i}\) would have values \((0,1,0)\), and \(x_{2i}\) would have values \((1,0,0)\). Next, we let \(\pi_i\) denote the probability of student smoking for the \(i\)th row group so that finally, the model is

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

for \(i=1,\ldots,3\). Thus, the log-odds of a student smoking is \(\beta_0\) for students whose parents don't smoke ("neither" group), \(\beta_0+\beta_1\) for students with one smoking parent ("one" group), and \(\beta_0+\beta_2\) for students with both smoking parents. In particular, \(\beta_1\)  is the difference in log odds or, equivalently, the log odds ratio for smoking when comparing students with one smoking parent against students with neither smoking parent. Similarly, \(\beta_2\) represents that when comparing students with two smoking parents against students with neither smoking parent.

Based on the tabulated data, where we saw \(1.42=416 ( 1168)/(188(  1823))\) and \(1.80=400(1168)/(188(1380))\), we're not surprised to see \(\hat{\beta}_1=\log(1.42)=0.351\) and \(\hat{\beta}_2=\log(1.80)=0.588\). The estimated intercept should be \(\hat{\beta}_0=\log(188/1168)=-1.826\).

Fitting A Multi-level Predictor Model in SAS and R

There are different models fit within the smoke.sas SAS code. Here is one where '0' = neither parent smokes, '1' = one smokes, and '2' = both smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.

data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic data=smoke descending;
class s (ref=first)/ param=ref;
model y/n = s / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression fro 2x3 tables with residuals';
run;

data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;

proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;

The option param=ref tells SAS to create a set of two dummy variables to distinguish among the three categories, where '0'=neither is a baseline because of option descending and ref=first (see the previous section for details).

Another way of doing the same in smoke.sas but using character labels (e.g, "both", "one", "neither") rather than numbers for the categories:

data smoke;
input s $ y n ;
cards;
both 400 1780
one 416 2239
neither 188 1356
;
proc logistic data=smoke descending;
class s (ref='neither') / order=data param=ref;
model y/n = s /scale=none lackfit;
output out=predict pred=prob;
title 'Logistic regression for 2x3 table';
run;
proc print data=predict;
run;

In the class statement, the option order=data tells SAS to sort the categories of S by the order in which they appear in the dataset rather than alphabetical order. The option ref='neither' makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's look at some relevant portions of the output that differ from the analysis of the corresponding \(2\times2\) table from the previous section of the notes.

 
Model Information
Data Set WORK.SMOKE
Response Variable (Events) y
Response Variable (Trials) n
Model binary logit
Optimization Technique Fisher's scoring
 
Number of Observations Read 3

Fisher scoring is a variant of Newton-Raphson method for ML estimation. In logistic regression they are equivalent. Notice how there are 3 observations since we have 3 groupings by the levels of the explanatory variable.

 
Class Level Information
Class Value Design Variables
s 0 0 0
  1 1 0
  2 0 1

From an explanatory variable S with 3 levels (0,1,2), we created two indicator variables:

\(x_1=1\) if parent smoking = One,
\(x_1=0\) otherwise,

\(x_2=1\) if parent smoking=Both,
\(x_2=0\) otherwise.

Since parent smoking = Neither is equal to 0 for both indicator variables, it serves as the baseline.

 
Class Level Information
Class Value Design Variables
s both 1 0
  one 0 1
  neither 0 0

Here, we specified neither to be a reference variable. We used option data=order so that both take values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based on alphabetical order, but in this case, they coincide.

The parameter estimates are reported in the output below.

 
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -1.8266 0.0786 540.2949 <.0001
s 1 1 0.3491 0.0955 13.3481 0.0003
s 2 1 0.5882 0.0970 36.8105 <.0001

Here, \(s_1\) and \(s_2\) correspond to \(x_1\) and \(x_2\). The saturated model is

\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)

and the estimated probability of a child smoking, given the explanatory variable is

\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)

For example, the predicted probability of a student smoking, given that only one parent is smoking is

\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)

Residuals

In SAS, if we include the statement

output out=predict pred=phat reschi=pearson resdev=deviance;

then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), and the Pearson and deviance residuals. We can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) (e.g., "s-hat") and failures \(n_i-\hat{\mu}_i=n_i(1-\hat{\pi}_i)\) (e.g., "f-hat").

data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;

proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;

Running this program gives a new output section:

 
Obs s y n prob shat fhat pearson deviance
1 2 400 1780 0.22472 400.000 1380.00 -.000000031 0
2 1 416 2239 0.18580 416.000 1823.00 -3.0886E-15 0
3 0 188 1356 0.13864 188.000 1168.00 -.000001291 -.000001196
.

Here "s" are the levels of the categorical predictor for parents' smoking behavior, "y" as before the number of students smoking for each level of the predictor, "n" the marginal counts for each level of the predictor", "prob" is the estimated probability of "success" (e.g. a student smoking given the level of the predictor), "s-hat" and "f-hat" expected number of successes and failures respectively, and "pearson" and "deviance" are Pearson and Deviance residuals.

All of the "s-hat" and "f-hat" values, that is predicted number of successes and failures are greater than 5.0, so the chi-square approximation is trustworthy.

Below is the R code (from smoke.R) that replicates the analysis of the original \(2\times3\) table with logistic regression.

#### Fitting logistic regression for a 2x3 table ##### 
#### Here is one way to read the data from the table and use glm()
#### Notice how we need to use family=binomial (link=logit) 
#### while with log-linear models we used family=poisson(link=log)

parentsmoke=as.factor(c(2,1,0))
response<-cbind(c(400,416,188),c(1380,1823,1168))
response
smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))

First, let’s see the table we created for the analysis.

> response
     [,1] [,2]
[1,]  400 1380
[2,]  416 1823
[3,]  188 1168

You should again notice the difference in data input. In SAS, we input y and n as data columns; here we just input data columns as yes and no.

Next, the model information is displayed.

Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals: 
[1]  0  0  0

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
parentsmoke1  0.34905    0.09554   3.654 0.000259 ***
parentsmoke2  0.58823    0.09695   6.067  1.3e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  3.8366e+01  on 2  degrees of freedom
Residual deviance: -3.7215e-13  on 0  degrees of freedom
AIC: 28.165

Number of Fisher Scoring iterations: 2

If we don’t specify the baseline using the "class" option, R will set the first level as the default. Here, it’s "0". The parentsmoke1 and parentsmoke2 variables correspond to the \(x_1\) and \(x_2\) indicators. The saturated model is

\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)

and the estimated probability of a child smoking given the explanatory variable:

\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)

For example, the predicted probability of a student smoking given that only one parent is smoking is

\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)

Residuals

In R, deviance residuals are given directly in the output by summary() function.

> smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))
> summary(smoke.logistic)
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e-09 ***
---

We can also obtain the deviance and Pearson residuals by using residuals.glm() function.

To obtain deviance residuals

> residuals(smoke.logistic)
[1] 0 0 0

To obtain Person residuals:

> residuals(smoke.logistic, type="pearson")
1 2 3
-2.440787e-13 -4.355932e-13 -1.477321e-11

To obtain the predicted values, use the function predict.glm() with which we can specify the type of predicted values we want.

type is the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model, the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.

For example, the code below gives predicted probabilities.

> predict(smoke.logistic, type="response")
1 2 3
0.2247191 0.1857972 0.1386431

All of the predicted number of successes and failures are greater than 5.0 so the chi-square approximation is trustworthy.

Overall goodness-of-fit

The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Suppose that we fit the intercept-only model as before by removing the predictors from the model statement:

proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression for intercept-only model';
run;

The goodness-of-fit statistics are shown below.

 
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 38.3658 2 19.1829 <.0001
Pearson 37.5663 2 18.7832 <.0001

Number of events/trials observations: 3

The Pearson statistic \(X^2= 37.5663\) and the deviance \(G^2 = 38.3658\) are precisely equal to the usual \(X^2\) and \(G^2\) for testing independence in the \(2\times 3\) table. In this example, the saturated model fits perfectly (as always), but the independence model does not fit well.

We can use the Hosmer-Lemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.

 
Partition for the Hosmer and Lemeshow Test
Group Total Event Nonevent
Observed Expected Observed Expected
1 1356 188 188.00 1168 1168.00
2 2239 416 416.00 1823 1823.00
3 1780 400 400.00 1380 1380.00
 
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
0.0000 1 1.0000
    Null deviance:  3.8366e+01  on 2  degrees of freedom
Residual deviance: -3.7215e-13  on 0  degrees of freedom
AIC: 28.165

The residual deviance is almost 0 because the model is saturated. We can find \(G^2= 38.366\) by null deviance, which is precisely equal to the usual \(G^2\) for testing independence in the \(2\times 3\) table. Since this statistic is large, it leads to a small p-value and provides significant evidence against the intercept-only model in favor of the current model. We can also find the same result in the output from the anova() part. Clearly, in this example, the saturated model fits perfectly (as always), but the independence model does not fit well.

We can use the Hosmer-Lemeshow test (available in the R file HosmerLemeshow.R) to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.

> ## hosmerlem() takes the vector of successes, predicted vector of success and g=# of groups as input
> ## produce the vector of predicted success "yhat"
> yhat=rowSums(response)*predict(smoke.logistic, type="response")
> yhat
1 2 3
400 416 188

> hosmerlem(response[,1], yhat, g=3) ## here run 3 groups
X^2 Df P(>Chi)
"-1.00593633284243e-24" "1" "."

We have shown that analyzing a \(2\times 3\) table for associations is equivalent to a binary logistic regression with two dummy variables as predictors. For \(2\times J\) tables, we would fit a binary logistic regression with \(J − 1\) indicator variables.

Testing the Joint Significance of All Predictors

Starting with the (full) model

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)

the null hypothesis \(H_0\colon\beta_1=\beta_2=0\) specifies the intercept-only (reduced) model:

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\)

In general, this test has degrees of freedom equal to the number of slope parameters, which is 2 in this case. Large chi-square statistics lead to small p-values and provide evidence to reject the intercept-only model in favor of the full model.

 
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 38.3658 2 <.0001
Score 37.5663 2 <.0001
Wald 37.0861 2 <.0001

 

Testing for an Arbitrary Group of Coefficients

The likelihood-ratio statistic is

\(G^2 = −2 (\log \mbox{ likelihood from reduced model}−(−2 \log \mbox{ likelihood from full model}))\)

and the degrees of freedom is \(k=\) the number of parameters differentiating the two models. The p-value is \(P(\chi^2_k \geq G^2)\).

 
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
Log Likelihood Full Log Likelihood
AIC 5178.510 5144.144 28.165
SC 5185.100 5163.913 47.933
-2 Log L 5176.510 5138.144 22.165

For our example, \(G^2 = 5176.510 − 5138.144 = 38.3658\) with \(3 − 1 = 2\) degrees of freedom. Notice that this matches

Likelihood Ratio 38.3658 2 <.0001

from the "Testing Global Hypothesis: BETA=0" section. Here is the model we just looked at in SAS.

data smoke;
input s1 s2 $ y n ;
cards;
0 1 400 1780
1 0 416 2239
0 0 188 1356
;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 s2 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;

proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;

Compare this to

model y/n = s1 /scale=none lackfit;

Testing Individual Parameters

For the test of the significance of a single variable \(x_j\)

\(H_0\colon\beta_j=0\) versus \(H_A\colon\beta_j\ne0\)

we can use the (Wald) test statistic and p-value. A large value of \(z\) (relative to standard normal) or \(z^2\) (relative to chi-square with 1 degree of freedom) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.

From the second row of this part of the output,

 
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -1.8266 0.0786 540.2949 <.0001
s 1 1 0.3491 0.0955 13.3481 0.0003
s 2 1 0.5882 0.0970 36.8105 <.0001

\(z^2=\left(\dfrac{0.3491}{0.0955}\right)^2=13.3481\)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
parentsmoke1  0.34905    0.09554   3.654 0.000259 ***
parentsmoke2  0.58823    0.09695   6.067  1.3e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

A large value of \(z\) (relative to standard normal) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.

 

Confidence Intervals: An approximate \((1 − \alpha)100\)% confidence interval for \(\beta_j\) is given by

\(\hat{\beta}_j \pm z_{(1-\alpha/2)} \times SE(\hat{\beta}_j)\)

For example, a 95% CI for \(\beta-1\) is

\(0.3491 \pm 1.96 (0.0955) = (0.16192, 0.5368)\)

Then, the 95% CI for the odds-ratio of a student smoking, if one parent is smoking in comparison to neither smoking, is

\((\exp(0.16192), \exp(0.5368)) = (1.176, 1.710)\)

Since this interval does not include the value 1, we can conclude that student and parents' smoking behaviors are associated. Furthermore,

 
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
s 1 vs 0 1.418 1.176 1.710
s 2 vs 0 1.801 1.489 2.178
  • The estimated conditional odds ratio of a student smoking between one parent smoking and neither smoking is \(\exp(\beta_1) = \exp(0.3491) = 1.418\).
  • The estimated conditional odds ratio of a student smoking between both parents smoking and neither smoking is \(\exp(\beta_2) = \exp(0.5882) = 1.801\).
  • The estimated conditional odds ratio of a student smoking between both parents smoking and one smoking is \(\exp(\beta_2)/\exp(\beta_1) = 1.801/1.418 = 1.27 = \exp(\beta_2-\beta_1) = \exp(0.5882 − 0.3491) = 1.27\). That is, compared with a student who has only one parent smoking, a student who has both parents smoking has an odds of smoking 1.27 times as high.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility