Printer-friendly versionPrinter-friendly version

There are different ways to run logistic regression depending on the format of the data. As before, since there are many different options, for details you need to refer to R help. Here are some general guidelines to keep in mind with a simple example outlined in dataformats.R where we created two binary random variables with n number of trials, e.g., n=100,

##create two binary vectors of length 100
x=sample(c(0,1),100, replace=T)
y=sample(c(0,1),100, replace=T)

If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous example), that is data are grouped.

> ##create a 2x2 table with counts
> xytab=table(x,y)
> xytab
x    0  1
  0 24 29
  1 26 21

glm(): Let y be the response variable capturing the number of events with the number of success (y = 1) and failures (y = 0). We need to create a response table (e.g., count) that has count for both the "success" and "failure" out of n number of trials in its columns. Notice that count table below, could be also the number of success y = 1, and then a column computed as n-y

> count=cbind(xytab[,2],xytab[,1])
> count
  [,1] [,2]
0   29   24
1   21   26

We also need a categorical predictor variable.

    > xfactor=factor(c("0","1"))
    > xfactor
    [1] 0 1
    Levels: 0 1

We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit") 

    > tmp3=glm(count~xfactor, family=binomial("logit"))
    > tmp3

    Call:  glm(formula = count ~ xfactor, family = binomial("logit"))

    (Intercept)     xfactor1 
         0.1892      -0.4028 

    Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
    Null Deviance:        1.005
    Residual Deviance: -4.441e-15     AIC: 12.72

If data come in a matrix form, i.e., subject × variables matrix with one line for each subject, like a database, where data are ungrouped

> xydata=cbind(x,y)
> xydata ## 100 rows, we are showing first 7
       x y
  [1,] 0 1
  [2,] 0 1
  [3,] 0 0
  [4,] 0 1
  [5,] 1 0
  [6,] 0 1
  [7,] 0 0

glm(): We need a binary response variable y and a predictor variable x, which in this case was also binary. We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit")

    > tmp1=glm(y~x, family=binomial("logit"))
    > tmp1

    Call:  glm(formula = y ~ x, family = binomial("logit"))

    (Intercept)            x 
         0.1892      -0.4028 

    Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
    Null Deviance:        138.6
    Residual Deviance: 137.6     AIC: 141.6

For more options see the course examples and homework. We will follow the R output through to explain the different parts of model fitting.  The outputs from SAS (or from many other software) will be essentially the same.

Example - Student Smoking

Let's begin with the collapsed 2 × 2 table:

Student does
not smoke
1–2 parents smoke
Neither parent smokes

In R, we can use the glm()function and specify the family = binomial(link = logit). See the files smoke.R and the output generated in smoke.out.  That R code corresponds to SAS code discussed in the previous section:

smoke.R program code

Here is the R output for the 2 × 2 table that we will use in R for logistics regression:

smoke.R output

Please Note: the table above is different from the one given from the SAS program. We input y and n as data columns in SAS; here we just input data columns as yes and no.

summary(smoke.logistic) gives the following model information:

smoke.R output

Model Information & model convergence status

glm(formula = response ~ parentsmoke, family = binomial(link = logit))

(Dispersion parameter for binomial family taken to be 1)

 Number of Fisher Scoring iterations: 2

 These sections tells us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g. binary logit), and type of scoring algorithm for parameter estimation. Fisher scoring is a variant of Newton-Raphson method for ML estimation. In logistic regression they are equivalent. Since we are using an iterative procedure to fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.

Overdispersion is an important concept with discrete data. In the context of logistic regression, overdispersion occurs when there are discrepancies between the observed responses yi and their predicted values \(\hat{\mu}_i = n_{i}\hat{\pi}_i\) and these values are larger than what the binomial model would predict.

If yi ~ Bin(ni, πi), the mean is μi = ni πi and the variance is μi(ni − μi)/ni. Overdispersion means that the data show evidence that the variance of the response yi is greater than μi(ni − μi)/ni.

If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodness-of-fit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.

Table of Coefficients:

             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
parentsmoke1  0.45918    0.08782   5.228 1.71e-07 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 This information gives us the fitted model:

\(\text{logit}(\hat{\pi_i})=\text{log}\dfrac{\hat{\pi_i}}{1-\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=-1.837+0.459parentsmoke1\) 

where parentsmoke1 is a dummy variable ((e.g. design variable) that takes value 1 if at least one parents is smoking and 0 if neither is smoking.  

X1 = 1 ("parentsmoke1") if parent smoking = at least one,
X1 = 0 ("parensmoke0") if parent smoking = neither

parent smoking =  0 is the baseline. We are modeling here probability of "children smoking". 

Estimated $\beta_0=-1.827$ with a standard error of 0.078 is significant and it says that log-odds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is -1.827 and it's statistically significant.

Estimated $\beta_1=0.459$ with a standard error of 0.088 is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the estimated odds-ratios; compare with our analysis in Section 6.2.

Testing Individual Parameters

Testing the hypothesis that the probability of the characteristic depends on the value of the jth variable.

Testing H0 : βj = 0     versus     H1 : βj ≠ 0.

The Wald chi-squared statistics \(z^2=(\hat{\beta}_j/\text{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Coefficients" section.

The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that you can calculate from the corresponding 2 × 2 table. The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior. This information indicates that parents' smoking behavior is a significant factor in the model. We could compare z2 to a chisquare with one degree of freedom; the p-value would then be the area to the right of z2 under the \(\chi^2_1\) density curve.

A value of z2 (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis βj = 0 at the .05-level.


Confidence Intervals of Individual Parameters:

An approximate (1 − α) × 100% confidence interval for βj is given by

\(\hat{\beta}_j \pm z_{(1-\alpha/2)} \times SE(\hat{\beta}_j)\)

For example, a 95% confidence interval for β1

0.4592 ± 1.96 × 0.0878 = (0.287112, 0.63128)

where 0.0878 is the "Standard Error"' from the "Coefficients" section. Then, the 95% confidence interval for the odds-ratio of a child smoking if one parent is smoking in comparison to neither smoking is

(exp(0.287112), exp(0.63128)) = (1.3325, 1.880)


Thus there is a strong association between parent's and children smoking behavior. But does this model fit?

Overall goodness-of-fit testing

Test: H0 : current model vs. HA : saturated model

Residual deviance: -3.7170e-13  on 0  degrees of freedom
AIC: 19.242

The goodness-of-fit statistics, deviance, G2 from this model is zero, because the model is saturated. If we want to know the fit of the intercept only model that is provided by

Test: H0 : intercept only model vs. HA : saturated model

Null deviance:  2.9121e+01  on 1  degrees of freedom

Suppose that we fit the intercept-only model. This is accomplished by removing the predictor from the model statement, like this:

glm(response~1, family=binomial(link=logit)

The goodness-of-fit statistics are shown below.

Null deviance: 29.121  on 1  degrees of freedom
Residual deviance: 29.121  on 1  degrees of freedom
AIC: 46.362

The deviance G2 = 29.1207 is precisely equal to the G2 for testing independence in this 2 × 2 table. Thus by the assumption, the intercept-only model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or odds-ratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well.

Analysis of deviance table: In R, we can test factors’ effects with the anova function to give an analysis of deviance table. We only include one factor in this model. So R dropped this factor (parentsmoke) and fit the intercept-only model to get the same statistics as above, i.e., the deviance G2 = 29.121.

smoke.R output

This example shows that analyzing a 2 × 2 table for association is equivalent to logistic regression with a single dummy variable. We can further compare these tests to the loglinear model of independence, e.g. see smokelog.R in Lesson 10.

The goodness-of-fit statistics, X2 and G2, are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the χ2 approximation to work well, we need the ni's sufficiently large so that the expected values, \(\hat{\mu}_i\) ≥ 5, and ni − \(\hat{\mu}_i\) ≥ 5 for most of the rows i. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.

With real data, we often find that the ni's are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem.