9.3 - Modeling Rate Data
9.3 - Modeling Rate DataExample: Lung Cancer Incidence
The table below summarizes the lung cancer incident counts (cases) per age group for four Danish cities from 1968 to 1971. Since it's reasonable to assume that the expected count of lung cancer incidents is proportional to the population size, we would prefer to model the rate of incidents per capita.
city age pop cases
Fredericia 40-54 3059 11
Horsens 40-54 2879 13
Kolding 40-54 3142 4
Vejle 40-54 2520 5
Fredericia 55-59 800 11
Horsens 55-59 1083 6
Kolding 55-59 1050 8
Vejle 55-59 878 7
Fredericia 60-64 710 11
Horsens 60-64 923 15
Kolding 60-64 895 7
Vejle 60-64 839 10
Fredericia 65-69 581 10
Horsens 65-69 834 10
Kolding 65-69 702 11
Vejle 65-69 631 14
Fredericia 70-74 509 11
Horsens 70-74 634 12
Kolding 70-74 535 9
Vejle 70-74 539 8
Fredericia 75+ 605 10
Horsens 75+ 782 2
Kolding 75+ 659 12
Vejle 75+ 619 7
With \(Y_i\) the count of lung cancer incidents and \(t_i\) the population size for the \(i^{th}\) row in the data, the Poisson rate regression model would be
\(\log \dfrac{\mu_i}{t_i}=\log \mu_i-\log t_i=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+\cdots\)
where \(Y_i\) has a Poisson distribution with mean \(E(Y_i)=\mu_i\), and \(x_1\), \(x_2\), etc. represent the (systematic) predictor set. Notice that by modeling the rate with population as the measurement size, population is not treated as another predictor, even though it is recorded in the data along with the other predictors. The main distinction the model is that no \(\beta\) coefficient is estimated for population size (it is assumed to be 1 by definition). Note also that population size is on the log scale to match the incident count.
By using an OFFSET option in the MODEL statement in GENMOD in SAS we specify an offset variable. The offset variable serves to normalize the fitted cell means per some space, grouping, or time interval to model the rates. In this case, population is the offset variable.
data cancer;
input city $ age $ pop cases;
lpop = log(pop);
cards;
Fredericia 40-54 3059 11
Horsens 40-54 2879 13
Kolding 40-54 3142 4
Vejle 40-54 2520 5
Fredericia 55-59 800 11
Horsens 55-59 1083 6
Kolding 55-59 1050 8
Vejle 55-59 878 7
Fredericia 60-64 710 11
Horsens 60-64 923 15
Kolding 60-64 895 7
Vejle 60-64 839 10
Fredericia 65-69 581 10
Horsens 65-69 834 10
Kolding 65-69 702 11
Vejle 65-69 631 14
Fredericia 70-74 509 11
Horsens 70-74 634 12
Kolding 70-74 535 9
Vejle 70-74 539 8
Fredericia 75+ 605 10
Horsens 75+ 782 2
Kolding 75+ 659 12
Vejle 75+ 619 7
; run;
proc genmod data=cancer;
class city(ref='Frederic') age(ref='40-54');
model cases = city age / dist=poi link=log offset=lpop type3;
run;
And, here is the output from the GENMOD procedure:
The GENMOD Procedure
Model Information | |
---|---|
Data Set | WORK.CANCER |
Distribution | Poisson |
Link Function | Log |
Dependent Variable | cases |
Offset Variable | lpop |
Number of Observations Read | 24 |
---|---|
Number of Observations Used | 24 |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
city | 4 | Horsens Kolding Vejle Frederic |
age | 6 | 55-59 60-64 65-69 70-74 75+ 40-54 |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 15 | 23.4475 | 1.5632 |
Scaled Deviance | 15 | 23.4475 | 1.5632 |
Pearson Chi-Square | 15 | 22.5616 | 1.5041 |
Scaled Pearson X2 | 15 | 22.5616 | 1.5041 |
Log Likelihood | 278.4530 | ||
Full Log Likelihood | -59.9178 | ||
AIC (smaller is better) | 137.8355 | ||
AICC (smaller is better) | 150.6927 | ||
BIC (smaller is better) | 148.4380 |
Algorithm converged. |
By adding offset
in the MODEL statement in GLM in R, we can specify an offset variable. The offset variable serves to normalize the fitted cell means per some space, grouping, or time interval to model the rates. In this case, population is the offset variable.
library(ISwR)
library(dplyr)
data(eba1977)
eba1977 = eba1977 %>% mutate(lpop = log(pop))
model = glm(cases ~ city + age, offset=lpop, family=poisson, data=eba1977)
summary(model)
anova(model, test='Chisq')
Here is the output we get:
> summary(model)
Call:
glm(formula = cases ~ city + age, family = poisson, data = eba1977,
offset = lpop)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.63573 -0.67296 -0.03436 0.37258 1.85267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6321 0.2003 -28.125 < 2e-16 ***
cityHorsens -0.3301 0.1815 -1.818 0.0690 .
cityKolding -0.3715 0.1878 -1.978 0.0479 *
cityVejle -0.2723 0.1879 -1.450 0.1472
age55-59 1.1010 0.2483 4.434 9.23e-06 ***
age60-64 1.5186 0.2316 6.556 5.53e-11 ***
age65-69 1.7677 0.2294 7.704 1.31e-14 ***
age70-74 1.8569 0.2353 7.891 3.00e-15 ***
age75+ 1.4197 0.2503 5.672 1.41e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 23.447 on 15 degrees of freedom
AIC: 137.84
The fitted model is
\(\log\dfrac{\hat{\mu}}{t} = -5.6321-0.3301C_1-0.3715C_2-0.2723C_3 + 1.1010A_1+\cdots+1.4197A_5\)
where \(C_1\), \(C_2\), and \(C_3\) are the indicators for cities Horsens, Kolding, and Vejle (Fredericia as baseline), and \(A_1,\ldots,A_5\) are the indicators for the last five age groups (40-54 as baseline).
Thus, for people in (baseline) age group 40-54 and in the city of Fredericia, the estimated average rate of lung cancer is
\(\dfrac{\hat{\mu}}{t}=e^{-5.6321}=0.003581\)
per person. For a group of 100 people in this category, the estimated average count of incidents would be \(100(0.003581)=0.3581\).
Does the overall model fit? From the deviance statistic 23.447 relative to a chi-square distribution with 15 degrees of freedom (the saturated model with city by age interactions would have 24 parameters), the p-value would be 0.0715, which is borderline. But the model with all interactions would require 24 parameters, which isn't desirable either. Is there perhaps something else we can try?
Considering Age as Quantitative
Since age was originally recorded in six groups, we needed five separate indicator variables to model it as a categorical predictor. We may also consider treating it as quantitative variable if we assign a numeric value, say the midpoint, to each group.
The following code creates a quantitative variable for age from the midpoint of each age group. It also creates an empirical rate variable for use in plotting. Note that this empirical rate is the sample ratio of observed counts to population size \(Y/t\), not to be confused with the population rate \(\mu/t\), which is estimated from the model.
data cancer;
set cancer;
if age='40-54' then age_mid=47;
if age='55-59' then age_mid=57;
if age='60-64' then age_mid=62;
if age='65-69' then age_mid=67;
if age='70-74' then age_mid=72;
if age='75+' then age_mid=75;
e_rate = cases/pop;
run;
proc sgplot data=cancer;
reg y=e_rate x=age_mid / group=city;
run;
With age treated as quantitative, we no longer include it in the CLASS statement.
proc genmod data=cancer;
class city;
model cases = age_mid / dist=poi link=log offset=lpop type3;
run;
The interpretation of the slope for age is now the increase in the rate of lung cancer (per capita) for each 1-year increase in age, provided city is held fixed.
The GENMOD Procedure
Model Information | |
---|---|
Data Set | WORK.CANCER |
Distribution | Poisson |
Link Function | Log |
Dependent Variable | cases |
Offset Variable | lpop |
Number of Observations Read | 24 |
---|---|
Number of Observations Used | 24 |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
city | 4 | Horsens Kolding Vejle Frederic |
Criteria For Assessing Goodness Of Fit | |||
---|---|---|---|
Criterion | DF | Value | Value/DF |
Deviance | 19 | 46.4483 | 2.4446 |
Scaled Deviance | 19 | 46.4483 | 2.4446 |
Pearson Chi-Square | 19 | 41.4714 | 2.1827 |
Scaled Pearson X2 | 19 | 41.4714 | 2.1827 |
Log Likelihood | 266.9526 | ||
Full Log Likelihood | -71.4182 | ||
AIC (smaller is better) | 152.8363 | ||
AICC (smaller is better) | 156.1696 | ||
BIC (smaller is better) | 158.7266 |
Algorithm converged. |
Analysis Of Maximum Likelihood Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard | Wald 95% Confidence Limits | Wald Chi-Square | Pr > ChiSq | ||
Intercept | 1 | -7.9822 | 0.4311 | -8.8271 | -7.1373 | 342.89 | <.0001 | |
city | Horsens | 1 | -0.3053 | 0.1814 | -0.6609 | 0.0503 | 2.83 | 0.0924 |
city | Kolding | 1 | -0.3503 | 0.1877 | -0.7182 | 0.0176 | 3.48 | 0.0620 |
city | Vejle | 1 | -0.2460 | 0.1878 | -0.6140 | 0.1220 | 1.72 | 0.1901 |
city | Frederic | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
age_mid | 1 | 0.0568 | 0.0065 | 0.0440 | 0.0696 | 75.62 | <.0001 | |
Scale | 0 | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
The scale parameter was held fixed.
LR Statistics For Type 3 Analysis | |||
---|---|---|---|
Source | DF | Chi-Square | Pr > ChiSq |
city | 3 | 4.25 | 0.2357 |
age_mid | 1 | 80.07 | <.0001 |
library(dplyr)
eba1977 = eba1977 %>% mutate(
age.mid = rep(c(47,57,62,67,72,75), each=4),
e.rate = cases/pop)
ggplot(eba1977, aes(x=age.mid, y=e.rate, color=city)) +
geom_point() + geom_smooth(method='lm', se=FALSE)
The plot generated shows increasing trends between age and lung cancer rates for each city. There is also some evidence for a city effect as well as for city by age interaction, but the significance of these is doubtful, given the relatively small data set.
As we have seen before when comparing model fits with a predictor as categorical or quantitative, the benefit of treating age as quantitative is that only a single slope parameter is needed to model a linear relationship between age and the cancer rate. The tradeoff is that if this linear relationship is not accurate, the lack of fit overall may still increase.
model = glm(cases ~ city + age.mid, offset=lpop, family=poisson, data=eba1977)
summary(model)
anova(model, test='Chisq')
The interpretation of the slope for age is now the increase in the log rate of lung cancer (per capita) for each 1-year increase in age, provided city is held fixed.
> summary(model)
Call:
glm(formula = cases ~ city + age.mid, family = poisson, data = eba1977,
offset = lpop)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.0037 -0.5442 0.3013 0.7957 2.2684
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.982221 0.431070 -18.517 <2e-16 ***
cityHorsens -0.305279 0.181423 -1.683 0.0924 .
cityKolding -0.350278 0.187705 -1.866 0.0620 .
cityVejle -0.246016 0.187769 -1.310 0.1901
age.mid 0.056759 0.006527 8.696 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 46.448 on 19 degrees of freedom
AIC: 152.84