Leukemia remission (logistic regression)
- Load the leukemia data.
- Fit a logistic regression model of REMISS vs CELL + SMEAR + INFIL + LI + BLAST + TEMP.
- Calculate 95% confidence intervals for the regression parameters based on asymptotic normality and based on profiling the least-squares estimation surface.
- Fit a logistic regression model of REMISS vs LI.
- Create a sctterplot of REMISS vs LI and add a fitted line based on the logistic regression model.
- Calculate the odds ratio for LI and a 95% confidence interval.
- Conduct a likelihood ratio (or deviance) test for LI.
- Calculate the sum of squared deviance residuals and the sum of squared Pearson residuals.
- Use the hoslem.test function in the ResourceSelection package to conduct the Hosmer-Lemeshow goodness-of-fit test.
- Calculate a version of R2 for logistic regression.
- Create residual plots using Pearson and deviance residuals.
- Calculate hat values (leverages), studentized residuals, and Cook's distances.
leukemia <- read.table("~/path-to-data/leukemia_remission.txt", header=T)
attach(leukemia)
model.1 <- glm(REMISS ~ CELL + SMEAR + INFIL + LI + BLAST + TEMP, family="binomial")
summary(model.1)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 64.25808 74.96480 0.857 0.391
# CELL 30.83006 52.13520 0.591 0.554
# SMEAR 24.68632 61.52601 0.401 0.688
# INFIL -24.97447 65.28088 -0.383 0.702
# LI 4.36045 2.65798 1.641 0.101
# BLAST -0.01153 2.26634 -0.005 0.996
# TEMP -100.17340 77.75289 -1.288 0.198
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 34.372 on 26 degrees of freedom
# Residual deviance: 21.594 on 20 degrees of freedom
# AIC: 35.594
confint.default(model.1) # based on asymptotic normality
confint(model.1) # based on profiling the least-squares estimation surface
model.2 <- glm(REMISS ~ LI, family="binomial")
summary(model.2)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -3.777 1.379 -2.740 0.00615 **
# LI 2.897 1.187 2.441 0.01464 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 34.372 on 26 degrees of freedom
# Residual deviance: 26.073 on 25 degrees of freedom
# AIC: 30.073
plot(x=LI, y=REMISS,
panel.last = lines(sort(LI), fitted(model.2)[order(LI)]))
exp(coef(model.2)[2]) # odds ratio = 18.12449
exp(confint.default(model.2)[2,]) # 95% CI = (1.770284, 185.561725)
anova(model.2, test="Chisq")
# Df Deviance Resid. Df Resid. Dev Pr(>Chi)
# NULL 26 34.372
# LI 1 8.2988 25 26.073 0.003967 **
sum(residuals(model.2, type="deviance")^2) # 26.07296
model.2$deviance # 26.07296
sum(residuals(model.2, type="pearson")^2) # 23.93298
library(ResourceSelection)
hoslem.test(model.2$y, fitted(model.2), g=9)
# Hosmer and Lemeshow goodness of fit (GOF) test
# data: REMISS, fitted(model.2)
# X-squared = 7.3293, df = 7, p-value = 0.3954
1-model.2$deviance/model.2$null.deviance # "R-squared" = 0.2414424
plot(1:27, residuals(model.2, type="pearson"), type="b")
plot(1:27, residuals(model.2, type="deviance"), type="b")
summary(influence.measures(model.2))
# dfb.1_ dfb.LI dffit cov.r cook.d hat
# 8 0.63 -0.83 -0.93_* 0.88 0.58 0.15
hatvalues(model.2)[8] # 0.1498395
residuals(model.2)[8] # -1.944852
rstudent(model.2)[8] # -2.185013
cooks.distance(model.2)[8] # 0.5833219
detach(leukemia)
Disease outbreak (logistic regression)
- Load the disease outbreak data.
- Create interaction variables.
- Fit "full" logistic regression model of Disease vs four predictors and five interactions.
- Fit "reduced" logistic regression model of Disease vs four predictors.
- Conduct a likelihood ratio (or deviance) test for the five interactions.
- Display the analysis of deviance table with sequential deviances.
disease <- read.table("~/path-to-data/DiseaseOutbreak.txt", header=T)
attach(disease)
Age.Middle <- Age*Middle
Age.Lower <- Age*Lower
Age.Sector <- Age*Sector
Middle.Sector <- Middle*Sector
Lower.Sector <- Lower*Sector
model.1 <- glm(Disease ~ Age + Middle + Lower + Sector + Age.Middle + Age.Lower +
Age.Sector + Middle.Sector + Lower.Sector, family="binomial")
model.2 <- glm(Disease ~ Age + Middle + Lower + Sector, family="binomial")
anova(model.2, model.1, test="Chisq")
# Resid. Df Resid. Dev Df Deviance Pr(>Chi)
# 1 93 101.054
# 2 88 93.996 5 7.0583 0.2163
anova(model.1, test="Chisq")
# Df Deviance Resid. Df Resid. Dev Pr(>Chi)
# NULL 97 122.318
# Age 1 7.4050 96 114.913 0.006504 **
# Middle 1 1.8040 95 113.109 0.179230
# Lower 1 1.6064 94 111.502 0.205003
# Sector 1 10.4481 93 101.054 0.001228 **
# Age.Middle 1 4.5697 92 96.484 0.032542 *
# Age.Lower 1 1.0152 91 95.469 0.313666
# Age.Sector 1 1.1202 90 94.349 0.289878
# Middle.Sector 1 0.0001 89 94.349 0.993427
# Lower.Sector 1 0.3531 88 93.996 0.552339
detach(disease)
Toxicity and insects (logistic regression using event/trial data format)
- Load the toxicity data.
- Create a Survivals variable and a matrix with Deaths in one column and Survivals in the other column.
- Fit a logistic regression model of Deaths vs Dose.
- Calculate 95% confidence intervals for the regression parameters based on asymptotic normality and based on profiling the least-squares estimation surface.
- Calculate the odds ratio for Dose and a 95% confidence interval.
- Display the observed and fitted probabilities.
- Create a sctterplot of observed probabilties vs Dose and add a fitted line based on the logistic regression model.
toxicity <- read.table("~/path-to-data/toxicity.txt", header=T)
attach(toxicity)
Survivals <- SampSize - Deaths
y <- cbind(Deaths, Survivals)
model.1 <- glm(y ~ Dose, family="binomial")
summary(model.1)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.64367 0.15610 -16.93 <2e-16 ***
# Dose 0.67399 0.03911 17.23 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 383.0695 on 5 degrees of freedom
# Residual deviance: 1.4491 on 4 degrees of freedom
# AIC: 39.358
confint.default(model.1) # based on asymptotic normality
confint(model.1) # based on profiling the least-squares estimation surface
exp(coef(model.1)[2]) # odds ratio = 1.962056
exp(confint.default(model.1)[2,]) # 95% CI = (1.817279, 2.118366)
cbind(Dose, SampSize, Deaths, Deaths/SampSize, fitted(model.1))
# Dose SampSize Deaths
# 1 1 250 28 0.112 0.1224230
# 2 2 250 53 0.212 0.2148914
# 3 3 250 93 0.372 0.3493957
# 4 4 250 126 0.504 0.5130710
# 5 5 250 172 0.688 0.6739903
# 6 6 250 197 0.788 0.8022286
plot(x=Dose, y=Deaths/SampSize,
panel.last = lines(sort(Dose), fitted(model.1)[order(Dose)]))
detach(toxicity)
Poisson example (Poisson regression)
- Load the poisson data.
- Create a scatterplot of the data.
- Fit a Poisson regression model of y vs x.
- Calculate 95% confidence intervals for the regression parameters based on asymptotic normality and based on profiling the least-squares estimation surface.
- Create a sctterplot of y vs x and add a fitted line based on the Poisson regression model.
- Conduct a likelihood ratio (or deviance) test for x.
- Calculate the sum of squared deviance residuals and the sum of squared Pearson residuals and calculate p-values based on chi-squared goodness-of-fit tests.
- Calculate pseudo R2 for Poisson regression.
- Create residual plots using Pearson and deviance residuals.
- Calculate hat values (leverages) and studentized residuals.
poisson <- read.table("~/path-to-data/poisson_simulated.txt", header=T)
attach(poisson)
plot(x=x, y=y)
model.1 <- glm(y ~ x, family="poisson")
summary(model.1)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.30787 0.28943 1.064 0.287
# x 0.07636 0.01730 4.413 1.02e-05 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for poisson family taken to be 1)
#
# Null deviance: 48.310 on 29 degrees of freedom
# Residual deviance: 27.842 on 28 degrees of freedom
# AIC: 124.5
confint.default(model.1) # based on asymptotic normality
confint(model.1) # based on profiling the least-squares estimation surface
plot(x=x, y=y,
panel.last = lines(sort(x), fitted(model.1)[order(x)]))
anova(model.1, test="Chisq")
# Df Deviance Resid. Df Resid. Dev Pr(>Chi)
# NULL 29 48.310
# x 1 20.468 28 27.842 6.065e-06 ***
sum(residuals(model.1, type="deviance")^2) # 27.84209
model.1$deviance # 27.84209
pchisq(model.1$deviance, 28, lower.tail=F) # p-value = 0.4728389
sum(residuals(model.1, type="pearson")^2) # 26.09324
pchisq(sum(residuals(model.1, type="pearson")^2), 28, lower.tail=F) # p-value = 0.5679192
1-model.1$deviance/model.1$null.deviance # Pseudo R-squared = 0.423676
plot(fitted(model.1), residuals(model.1, type="pearson"))
plot(fitted(model.1), residuals(model.1, type="deviance"))
summary(influence.measures(model.1))
# dfb.1_ dfb.x dffit cov.r cook.d hat
# 10 -0.22 0.30 0.37 1.25_* 0.08 0.18
# 21 0.37 -0.48 -0.57 1.30_* 0.15 0.23_*
residuals(model.1)[8] # 1.974329
rstudent(model.1)[8] # 2.028255
detach(poisson)
Hospital recovery (exponential regression)
- Load the recovery data.
- Create log(prog) variable.
- Obtain starting values for nonlinear model parameters from fitting a simple linear regression model of log(prog) vs days.
- Fit nonlinear regression model to data using these starting values.
- Create a scatterplot of prog vs days and add a fitted line based on the nonlinear regression model.
recovery <- read.table("~/path-to-data/recovery.txt", header=T)
attach(recovery)
logprog <- log(prog)
summary(lm(logprog ~ days))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.037159 0.084103 48.00 5.08e-16 ***
# days -0.037974 0.002284 -16.62 3.86e-10 ***
exp(4.037159) # 56.66513
model.1 <- nls(prog ~ theta1 * exp(theta2 * days),
start=list(theta1=56.7, theta2=-0.038))
summary(model.1)
# Estimate Std. Error t value Pr(>|t|)
# theta1 58.606532 1.472159 39.81 5.70e-15 ***
# theta2 -0.039586 0.001711 -23.13 6.01e-12 ***
# ---
# Residual standard error: 1.951 on 13 degrees of freedom
plot(x=days, y=prog,
panel.last = lines(sort(days), fitted(model.1)[order(days)]))
detach(recovery)
U.S. census population (population growth nonlinear regression)
- Load the census data.
- Obtain starting values for nonlinear model parameters from observing features of a scatterplot of population vs year.
- Fit nonlinear regression model to data using these starting values.
- Create a scatterplot of population vs year and add a fitted line based on the nonlinear regression model.
- Create a residual plot.
census <- read.table("~/path-to-data/us_census.txt", header=T)
attach(census)
plot(x=year, y=population)
log(350/3.929-1) # 4.478259
log(350/5.308-1) - log(350/3.929-1) # -0.3048229
model.1 <- nls(population ~ beta1 / (1 + exp(beta2 + beta3 * (year - 1790) / 10)),
start=list(beta1=350, beta2=4.5, beta3=-0.3))
summary(model.1)
# Estimate Std. Error t value Pr(>|t|)
# beta1 389.16551 30.81196 12.63 2.2e-10 ***
# beta2 3.99035 0.07032 56.74 < 2e-16 ***
# beta3 -0.22662 0.01086 -20.87 4.6e-14 ***
# ---
# Residual standard error: 4.45 on 18 degrees of freedom
plot(x=year, y=population,
panel.last = lines(sort(year), fitted(model.1)[order(year)]))
plot(x=year, y=residuals(model.1),
panel.last = abline(h=0, lty=2))
detach(census)