6.2.4 - Multi-level Predictor

The concepts discussed with binary predictors extend to predictors with multiple levels. In this lesson we consider \(Y_i\) a binary response, \(x_i\) a discrete explanatory variable (with \(k = 3\) levels, and make connections to the analysis of \(2\times3\) tables. But the basic ideas extend to any \(2\times J\) table.

We begin by replicating the analysis of the original \(3\times2\) table with logistic regression.

How many parents smoke?	Student smokes?
How many parents smoke?	Yes (Z = 1)	No (Z = 2)
Both (Y = 1)	400	1380
One (Y = 2)	416	1823
Neither (Y = 3)	188	1168

First, we re-express the data in terms of \(Y_i=\) number of smoking students, and \(n_i=\) number of students for the three groups based on parents behavior (we can think of \(i\) as an index for the rows in the table).

How many parents smoke?	Student smokes?
How many parents smoke?	\(y_i\)	\(n_i\)
Both	400	1780
One	416	2239
Neither	188	1356

Then we decide on a baseline level for the explanatory variable \(x\) and create \(k − 1\)indicators if \(x\) is a categorical variable with \(k\) levels. For our example, we set "Neither" as the baseline for the parent smoking predictor and define a pair of indicators that take one of two values, specifically,

\(x_{1i}=1\) if parent smoking = "One" ,
\(x_{1i}=0\) otherwise

\(x_{2i}=1\) if parent smoking = "Both",
\(x_{2i}=0\) otherwise.

If these two indicator variables were added as columns in the table above, \(x_{1i}\) would have values \((0,1,0)\), and \(x_{2i}\) would have values \((1,0,0)\). Next, we let \(\pi_i\) denote the probability of student smoking for the \(i\)th row group so that finally, the model is

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

for \(i=1,\ldots,3\). Thus, the log-odds of a student smoking is \(\beta_0\) for students whose parents don't smoke ("neither" group), \(\beta_0+\beta_1\) for students with one smoking parent ("one" group), and \(\beta_0+\beta_2\) for students with both smoking parents. In particular, \(\beta_1\) is the difference in log odds or, equivalently, the log odds ratio for smoking when comparing students with one smoking parent against students with neither smoking parent. Similarly, \(\beta_2\) represents that when comparing students with two smoking parents against students with neither smoking parent.

Based on the tabulated data, where we saw \(1.42=416 ( 1168)/(188( 1823))\) and \(1.80=400(1168)/(188(1380))\), we're not surprised to see \(\hat{\beta}_1=\log(1.42)=0.351\) and \(\hat{\beta}_2=\log(1.80)=0.588\). The estimated intercept should be \(\hat{\beta}_0=\log(188/1168)=-1.826\).

Fitting A Multi-level Predictor Model in SAS and R

There are different models fit within the smoke.sas SAS code. Here is one where '0' = neither parent smokes, '1' = one smokes, and '2' = both smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.

data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic data=smoke descending;
class s (ref=first)/ param=ref;
model y/n = s / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression fro 2x3 tables with residuals';
run;

data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;

proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;

The option param=ref tells SAS to create a set of two dummy variables to distinguish among the three categories, where '0'=neither is a baseline because of option descending and ref=first (see the previous section for details).

Another way of doing the same in smoke.sas but using character labels (e.g, "both", "one", "neither") rather than numbers for the categories:

data smoke;
input s $ y n ;
cards;
both 400 1780
one 416 2239
neither 188 1356
;
proc logistic data=smoke descending;
class s (ref='neither') / order=data param=ref;
model y/n = s /scale=none lackfit;
output out=predict pred=prob;
title 'Logistic regression for 2x3 table';
run;
proc print data=predict;
run;

In the class statement, the option order=data tells SAS to sort the categories of S by the order in which they appear in the dataset rather than alphabetical order. The option ref='neither' makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's look at some relevant portions of the output that differ from the analysis of the corresponding \(2\times2\) table from the previous section of the notes.


Model Information
Data Set	WORK.SMOKE
Response Variable (Events)	y
Response Variable (Trials)	n
Model	binary logit
Optimization Technique	Fisher's scoring


Number of Observations Read	3

Fisher scoring is a variant of Newton-Raphson method for ML estimation. In logistic regression they are equivalent. Notice how there are 3 observations since we have 3 groupings by the levels of the explanatory variable.


Class Level Information
Class	Value	Design Variables
s	0	0	0
	1	1	0
	2	0	1

From an explanatory variable S with 3 levels (0,1,2), we created two indicator variables:

\(x_1=1\) if parent smoking = One,
\(x_1=0\) otherwise,

\(x_2=1\) if parent smoking=Both,
\(x_2=0\) otherwise.

Since parent smoking = Neither is equal to 0 for both indicator variables, it serves as the baseline.


Class Level Information
Class	Value	Design Variables
s	both	1	0
	one	0	1
	neither	0	0

Here, we specified neither to be a reference variable. We used option data=order so that both take values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based on alphabetical order, but in this case, they coincide.

The parameter estimates are reported in the output below.


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-1.8266	0.0786	540.2949	<.0001
s	1	1	0.3491	0.0955	13.3481	0.0003
s	2	1	0.5882	0.0970	36.8105	<.0001

Here, \(s_1\) and \(s_2\) correspond to \(x_1\) and \(x_2\). The saturated model is

\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)

and the estimated probability of a child smoking, given the explanatory variable is

\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)

For example, the predicted probability of a student smoking, given that only one parent is smoking is

\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)

Residuals

In SAS, if we include the statement

output out=predict pred=phat reschi=pearson resdev=deviance;

then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), and the Pearson and deviance residuals. We can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) (e.g., "s-hat") and failures \(n_i-\hat{\mu}_i=n_i(1-\hat{\pi}_i)\) (e.g., "f-hat").

data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;

proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;

Running this program gives a new output section:


Obs	s	y	n	prob	shat	fhat	pearson	deviance
1	2	400	1780	0.22472	400.000	1380.00	-.000000031	0
2	1	416	2239	0.18580	416.000	1823.00	-3.0886E-15	0
3	0	188	1356	0.13864	188.000	1168.00	-.000001291	-.000001196

.

Here "s" are the levels of the categorical predictor for parents' smoking behavior, "y" as before the number of students smoking for each level of the predictor, "n" the marginal counts for each level of the predictor", "prob" is the estimated probability of "success" (e.g. a student smoking given the level of the predictor), "s-hat" and "f-hat" expected number of successes and failures respectively, and "pearson" and "deviance" are Pearson and Deviance residuals.

All of the "s-hat" and "f-hat" values, that is predicted number of successes and failures are greater than 5.0, so the chi-square approximation is trustworthy.

Below is the R code (from smoke.R) that replicates the analysis of the original \(2\times3\) table with logistic regression.

#### Fitting logistic regression for a 2x3 table ##### 
#### Here is one way to read the data from the table and use glm()
#### Notice how we need to use family=binomial (link=logit) 
#### while with log-linear models we used family=poisson(link=log)

parentsmoke=as.factor(c(2,1,0))
response<-cbind(c(400,416,188),c(1380,1823,1168))
response
smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))

First, let’s see the table we created for the analysis.

> response
     [,1] [,2]
[1,]  400 1380
[2,]  416 1823
[3,]  188 1168

You should again notice the difference in data input. In SAS, we input y and n as data columns; here we just input data columns as yes and no.

Next, the model information is displayed.

Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals: 
[1]  0  0  0

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
parentsmoke1  0.34905    0.09554   3.654 0.000259 ***
parentsmoke2  0.58823    0.09695   6.067  1.3e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  3.8366e+01  on 2  degrees of freedom
Residual deviance: -3.7215e-13  on 0  degrees of freedom
AIC: 28.165

Number of Fisher Scoring iterations: 2

If we don’t specify the baseline using the "class" option, R will set the first level as the default. Here, it’s "0". The parentsmoke1 and parentsmoke2 variables correspond to the \(x_1\) and \(x_2\) indicators. The saturated model is

\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)

and the estimated probability of a child smoking given the explanatory variable:

\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)

For example, the predicted probability of a student smoking given that only one parent is smoking is

\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)

Residuals

In R, deviance residuals are given directly in the output by summary() function.

> smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))
> summary(smoke.logistic)

Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e-09 ***
---

We can also obtain the deviance and Pearson residuals by using residuals.glm() function.

To obtain deviance residuals

> residuals(smoke.logistic)
[1] 0 0 0

To obtain Person residuals:

> residuals(smoke.logistic, type="pearson")
1 2 3
-2.440787e-13 -4.355932e-13 -1.477321e-11

To obtain the predicted values, use the function predict.glm() with which we can specify the type of predicted values we want.

type is the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model, the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.

For example, the code below gives predicted probabilities.

> predict(smoke.logistic, type="response")
1 2 3
0.2247191 0.1857972 0.1386431

All of the predicted number of successes and failures are greater than 5.0 so the chi-square approximation is trustworthy.

Overall goodness-of-fit

The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Suppose that we fit the intercept-only model as before by removing the predictors from the model statement:

proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression for intercept-only model';
run;

The goodness-of-fit statistics are shown below.


Deviance and Pearson Goodness-of-Fit Statistics
Criterion	Value	DF	Value/DF	Pr > ChiSq
Deviance	38.3658	2	19.1829	<.0001
Pearson	37.5663	2	18.7832	<.0001

Number of events/trials observations: 3

The Pearson statistic \(X^2= 37.5663\) and the deviance \(G^2 = 38.3658\) are precisely equal to the usual \(X^2\) and \(G^2\) for testing independence in the \(2\times 3\) table. In this example, the saturated model fits perfectly (as always), but the independence model does not fit well.

We can use the Hosmer-Lemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.


Partition for the Hosmer and Lemeshow Test
Group	Total	Event		Nonevent
Group	Total	Observed	Expected	Observed	Expected
1	1356	188	188.00	1168	1168.00
2	2239	416	416.00	1823	1823.00
3	1780	400	400.00	1380	1380.00


Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square	DF	Pr > ChiSq
0.0000	1	1.0000

    Null deviance:  3.8366e+01  on 2  degrees of freedom
Residual deviance: -3.7215e-13  on 0  degrees of freedom
AIC: 28.165

The residual deviance is almost 0 because the model is saturated. We can find \(G^2= 38.366\) by null deviance, which is precisely equal to the usual \(G^2\) for testing independence in the \(2\times 3\) table. Since this statistic is large, it leads to a small p-value and provides significant evidence against the intercept-only model in favor of the current model. We can also find the same result in the output from the anova() part. Clearly, in this example, the saturated model fits perfectly (as always), but the independence model does not fit well.

We can use the Hosmer-Lemeshow test (available in the R file HosmerLemeshow.R) to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.

> ## hosmerlem() takes the vector of successes, predicted vector of success and g=# of groups as input
> ## produce the vector of predicted success "yhat"
> yhat=rowSums(response)*predict(smoke.logistic, type="response")
> yhat
1 2 3
400 416 188

> hosmerlem(response[,1], yhat, g=3) ## here run 3 groups
X^2 Df P(>Chi)
"-1.00593633284243e-24" "1" "."

We have shown that analyzing a \(2\times 3\) table for associations is equivalent to a binary logistic regression with two dummy variables as predictors. For \(2\times J\) tables, we would fit a binary logistic regression with \(J − 1\) indicator variables.

Testing the Joint Significance of All Predictors

Starting with the (full) model

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)

the null hypothesis \(H_0\colon\beta_1=\beta_2=0\) specifies the intercept-only (reduced) model:

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\)

In general, this test has degrees of freedom equal to the number of slope parameters, which is 2 in this case. Large chi-square statistics lead to small p-values and provide evidence to reject the intercept-only model in favor of the full model.


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	38.3658	2	<.0001
Score	37.5663	2	<.0001
Wald	37.0861	2	<.0001

Testing for an Arbitrary Group of Coefficients

The likelihood-ratio statistic is

\(G^2 = −2 (\log \mbox{ likelihood from reduced model}−(−2 \log \mbox{ likelihood from full model}))\)

and the degrees of freedom is \(k=\) the number of parameters differentiating the two models. The p-value is \(P(\chi^2_k \geq G^2)\).


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
Criterion	Intercept Only	Log Likelihood	Full Log Likelihood
AIC	5178.510	5144.144	28.165
SC	5185.100	5163.913	47.933
-2 Log L	5176.510	5138.144	22.165

For our example, \(G^2 = 5176.510 − 5138.144 = 38.3658\) with \(3 − 1 = 2\) degrees of freedom. Notice that this matches

Likelihood Ratio 38.3658 2 <.0001

from the "Testing Global Hypothesis: BETA=0" section. Here is the model we just looked at in SAS.

data smoke;
input s1 s2 $ y n ;
cards;
0 1 400 1780
1 0 416 2239
0 0 188 1356
;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 s2 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;

proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;

Compare this to

model y/n = s1 /scale=none lackfit;

Testing Individual Parameters

For the test of the significance of a single variable \(x_j\)

\(H_0\colon\beta_j=0\) versus \(H_A\colon\beta_j\ne0\)

we can use the (Wald) test statistic and p-value. A large value of \(z\) (relative to standard normal) or \(z^2\) (relative to chi-square with 1 degree of freedom) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.

From the second row of this part of the output,


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-1.8266	0.0786	540.2949	<.0001
s	1	1	0.3491	0.0955	13.3481	0.0003
s	2	1	0.5882	0.0970	36.8105	<.0001

\(z^2=\left(\dfrac{0.3491}{0.0955}\right)^2=13.3481\)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
parentsmoke1  0.34905    0.09554   3.654 0.000259 ***
parentsmoke2  0.58823    0.09695   6.067  1.3e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

A large value of \(z\) (relative to standard normal) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.

Confidence Intervals: An approximate \((1 − \alpha)100\)% confidence interval for \(\beta_j\) is given by

\(\hat{\beta}_j \pm z_{(1-\alpha/2)} \times SE(\hat{\beta}_j)\)

For example, a 95% CI for \(\beta-1\) is

\(0.3491 \pm 1.96 (0.0955) = (0.16192, 0.5368)\)

Then, the 95% CI for the odds-ratio of a student smoking, if one parent is smoking in comparison to neither smoking, is

\((\exp(0.16192), \exp(0.5368)) = (1.176, 1.710)\)

Since this interval does not include the value 1, we can conclude that student and parents' smoking behaviors are associated. Furthermore,


Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
s 1 vs 0	1.418	1.176	1.710
s 2 vs 0	1.801	1.489	2.178

The estimated conditional odds ratio of a student smoking between one parent smoking and neither smoking is \(\exp(\beta_1) = \exp(0.3491) = 1.418\).
The estimated conditional odds ratio of a student smoking between both parents smoking and neither smoking is \(\exp(\beta_2) = \exp(0.5882) = 1.801\).
The estimated conditional odds ratio of a student smoking between both parents smoking and one smoking is \(\exp(\beta_2)/\exp(\beta_1) = 1.801/1.418 = 1.27 = \exp(\beta_2-\beta_1) = \exp(0.5882 − 0.3491) = 1.27\). That is, compared with a student who has only one parent smoking, a student who has both parents smoking has an odds of smoking 1.27 times as high.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility