6.2.2  Fitting the Model in R
There are different ways to run logistic regression depending on the format of the data. As before, since there are many different options, for details you need to refer to R help. Here are some general guidelines to keep in mind with a simple example outlined in dataformats.R where we created two binary random variables with n number of trials, e.g., n=100,
##create two binary vectors of length 100
x=sample(c(0,1),100, replace=T)
y=sample(c(0,1),100, replace=T)
If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous example), that is data are grouped.
> ##create a 2x2 table with counts
> xytab=table(x,y)
> xytab
y
x 0 1
0 24 29
1 26 21
glm(): Let y be the response variable capturing the number of events with the number of success (y = 1) and failures (y = 0). We need to create a response table (e.g., count) that has count for both the "success" and "failure" out of n number of trials in its columns. Notice that count table below, could be also the number of success y = 1, and then a column computed as ny
> count=cbind(xytab[,2],xytab[,1])
> count
[,1] [,2]
0 29 24
1 21 26
We also need a categorical predictor variable.
> xfactor=factor(c("0","1"))
> xfactor
[1] 0 1
Levels: 0 1
We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit")
> tmp3=glm(count~xfactor, family=binomial("logit"))
> tmp3
Call: glm(formula = count ~ xfactor, family = binomial("logit"))
Coefficients:
(Intercept) xfactor1
0.1892 0.4028
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 1.005
Residual Deviance: 4.441e15 AIC: 12.72
If data come in a matrix form, i.e., subject × variables matrix with one line for each subject, like a database, where data are ungrouped
> xydata=cbind(x,y)
> xydata ## 100 rows, we are showing first 7
x y
[1,] 0 1
[2,] 0 1
[3,] 0 0
[4,] 0 1
[5,] 1 0
[6,] 0 1
[7,] 0 0.....
glm(): We need a binary response variable y and a predictor variable x, which in this case was also binary. We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit")
> tmp1=glm(y~x, family=binomial("logit"))
> tmp1
Call: glm(formula = y ~ x, family = binomial("logit"))
Coefficients:
(Intercept) x
0.1892 0.4028
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 137.6 AIC: 141.6
For more options see the course examples and homework. We will follow the R output through to explain the different parts of model fitting. The outputs from SAS (or from many other software) will be essentially the same.
Example  Student Smoking
Let's begin with the collapsed 2 × 2 table:
Student
smokes

Student does
not smoke


1–2 parents smoke

816

3203

Neither parent smokes

188

1168

In R, we can use the glm()function and specify the family = binomial(link = logit). See the files smoke.R and the output generated in smoke.out. That R code corresponds to SAS code discussed in the previous section:
Here is the R output for the 2 × 2 table that we will use in R for logistics regression:
Please Note: the table above is different from the one given from the SAS program. We input y and n as data columns in SAS; here we just input data columns as yes and no.
summary(smoke.logistic) gives the following model information:
Model Information & model convergence status
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
(Dispersion parameter for binomial family taken to be 1)
Number of Fisher Scoring iterations: 2
These sections tells us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g. binary logit), and type of scoring algorithm for parameter estimation. Fisher scoring is a variant of NewtonRaphson method for ML estimation. In logistic regression they are equivalent. Since we are using an iterative procedure to fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.
Overdispersion is an important concept with discrete data. In the context of logistic regression, overdispersion occurs when there are discrepancies between the observed responses y_{i} and their predicted values \(\hat{\mu}_i = n_{i}\hat{\pi}_i\) and these values are larger than what the binomial model would predict.
If y_{i} ~ Bin(n_{i}, π_{i}), the mean is μ_{i} = n_{i} π_{i} and the variance is μ_{i}(n_{i} − μ_{i})/n_{i}. Overdispersion means that the data show evidence that the variance of the response y_{i} is greater than μ_{i}(n_{i} − μ_{i})/n_{i}.
If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodnessoffit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e07 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This information gives us the fitted model:
\(\text{logit}(\hat{\pi_i})=\text{log}\dfrac{\hat{\pi_i}}{1\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=1.837+0.459parentsmoke1\)
where parentsmoke1 is a dummy variable ((e.g. design variable) that takes value 1 if at least one parents is smoking and 0 if neither is smoking.
X_{1} = 1 ("parentsmoke1") if parent smoking = at least one,
X_{1} = 0 ("parensmoke0") if parent smoking = neither
parent smoking = 0 is the baseline. We are modeling here probability of "children smoking".
Estimated $\beta_0=1.827$ with a standard error of 0.078 is significant and it says that logodds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is 1.827 and it's statistically significant.
Estimated $\beta_1=0.459$ with a standard error of 0.088 is significant and it says that logoddsratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the estimated oddsratios; compare with our analysis in Section 6.2.
Testing Individual Parameters
Testing the hypothesis that the probability of the characteristic depends on the value of the jth variable.
Testing H_{0} : β_{j} = 0 versus H_{1} : β_{j} ≠ 0.
The Wald chisquared statistics \(z^2=(\hat{\beta}_j/\text{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Coefficients" section.
The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that you can calculate from the corresponding 2 × 2 table. The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior. This information indicates that parents' smoking behavior is a significant factor in the model. We could compare z^{2} to a chisquare with one degree of freedom; the pvalue would then be the area to the right of z^{2} under the \(\chi^2_1\) density curve.
A value of z^{2} (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis β_{j} = 0 at the .05level.
\(\beta_1:\left(\dfrac{0.4592}{0.0878}\right)^2=27.354\)
Confidence Intervals of Individual Parameters:
An approximate (1 − α) × 100% confidence interval for β_{j} is given by
\(\hat{\beta}_j \pm z_{(1\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% confidence interval for β_{1}
0.4592 ± 1.96 × 0.0878 = (0.287112, 0.63128)
where 0.0878 is the "Standard Error"' from the "Coefficients" section. Then, the 95% confidence interval for the oddsratio of a child smoking if one parent is smoking in comparison to neither smoking is
(exp(0.287112), exp(0.63128)) = (1.3325, 1.880)
Thus there is a strong association between parent's and children smoking behavior. But does this model fit?
Overall goodnessoffit testing
Test: H_{0} : current model vs. H_{A} : saturated model
Residual deviance: 3.7170e13 on 0 degrees of freedom
AIC: 19.242
The goodnessoffit statistics, deviance, G^{2} from this model is zero, because the model is saturated. If we want to know the fit of the intercept only model that is provided by
Test: H_{0} : intercept only model vs. H_{A} : saturated model
Null deviance: 2.9121e+01 on 1 degrees of freedom
Suppose that we fit the interceptonly model. This is accomplished by removing the predictor from the model statement, like this:
glm(response~1, family=binomial(link=logit)
The goodnessoffit statistics are shown below.
Null deviance: 29.121 on 1 degrees of freedom
Residual deviance: 29.121 on 1 degrees of freedom
AIC: 46.362
The deviance G^{2} = 29.1207 is precisely equal to the G^{2} for testing independence in this 2 × 2 table. Thus by the assumption, the interceptonly model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or oddsratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well.
Analysis of deviance table: In R, we can test factors’ effects with the anova function to give an analysis of deviance table. We only include one factor in this model. So R dropped this factor (parentsmoke) and fit the interceptonly model to get the same statistics as above, i.e., the deviance G^{2} = 29.121.
This example shows that analyzing a 2 × 2 table for association is equivalent to logistic regression with a single dummy variable. We can further compare these tests to the loglinear model of independence, e.g. see smokelog.R in Lesson 10.
The goodnessoffit statistics, X^{2} and G^{2}, are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the χ^{2} approximation to work well, we need the n_{i}'s sufficiently large so that the expected values, \(\hat{\mu}_i\) ≥ 5, and n_{i} − \(\hat{\mu}_i\) ≥ 5 for most of the rows i. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.
With real data, we often find that the n_{i}'s are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem.