Here we work through this example in SAS.
This problem refers to data from a study of nesting horseshoe crabs (J. Brockmann, Ethology 1996); see also Agresti (2007) Sec. 3 and Agresti (2013) Sec. 4.3. Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing near her. Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weight (Wt), and carapace width (W). The response outcome for each female crab is her number of satellites (Sa). There are 173 females in this study. Datafile: crab.txt [1].
Let’s first see if the width of female's back can explain the number of satellites attached. We will start by fitting a Poisson regression model with only one predictor, width (W) via PROC GENMOD as shown in the first part of the crab.sas [2] SAS Program as shown below:
Model Sa=w specifies the response (Sa) and predictor width (W). Also, note that specification of Poisson distribution are dist=pois and link=log. The obstats option as before will give us a table of observed and predicted values and residuals. You can use any additional options in GENMOD, e.g., TYPE3, etc.
Here is the output that you should get from running just this first part of the crab.sas [2] program (found in crab.lst [3]):
Model Information
What do you learn from the "Model Information" section? How is this different from when we fitted logistic regression models? This section gives information on the GLM that's fitted. More specifically, we see that the response is distributed via Poisson, the link function is log and the dependent variable is Sa. The number of observations in the data set used is 173 and that's all of them were used in the analysis, that is there were no missing values neither for the response variable nor the predictor width.
Does the model fit well? What does the Value/DF tell you? Given the value of deviance statistic of 567.879 with 171 df, the pvalue is zero and the Value/DF is much bigger than 1, so the model does not fit well. The lack of fit maybe due to missing data, covariates or overdispersion.
Is width the significant predictor? From the "Analysis of Parameter Estimates" table, with ChiSquare stats of 67.51 (1df), the pvalue is 0.0001 and this rejects the null hypothesis that \(\beta_W=0\), we can conclude that the width of a female's crab back is a significant predictor.
The estimated model is: log (μ_{i}) = 3.3048 + 0.164W_{i}
The ASE of estimated β = 0.164 is 0.020 which is small, and the slope is statistically significant.
Interpretation: Since estimate of β > 0, the wider the female crab the greater expected number of male satellites on the multiplicative order as exp(0.1640) = 1.18. More specifically, for one unit increase in the width, the number of Sa will increase and it will be multiplied by 1.18.
If we look at the scatter plot of W vs. Sa (see further below) we may suspect some outliers, e.g., observations #48, #101 and #165. For example, #165 has W = 33.5, and Sa = 7. But by studying the residuals, we see that this is not an influential observation, e.g., standardized deviance residual StResdev = 0.739. You can consider other types of residuals, influence measures (like we saw in linear regression), as well as residual plots. Notice that there are some other points that have large outliers, e.g., #101. Here is the first two lines of the output and then it skips down to highlight the output for observations #48, #101 and #165. For the entire listing of the output run the SAS code:
From the above output we can also see that the predicted counts ("Pred") and the values of the linear predictor ("Xbeta") that is the log of the expected counts. For example, for the first observation, Pred = 3.810, Xbeta = 1.3377, log(Pred) = Xbeta, that is log(3.810) = 1.3377, or exp(Xbeta) = Pred, that is exp(1.3377) = 3.810.
From the above analysis, we see that the predictor is significant but the model does not fit well. We can further assess the lack of fit by plotting residuals or influential points, but let us assume for now that we do not have any other covariates and try to adjust for overdispersion to see if we can improve the model fit.
In the above model we detect a potential problem with overdispersion since the scale factor, e.g., Value/DF, is greater than 1.
What do you think overdispersion means for Poisson Regression? What does it tell you about the relationship between the mean and the variance of the Poisson distribution for the number of satellites? Recall that one of the reasons for overdispersion is heterogeneity where subjects within each covariate combination still differ greatly (i.e., even crabs with similar width will have different number of satellites). If that's the case, which assumption of the Poisson model [4]that is Poisson regression model is violated?
As we saw in logistic regression, if we want to test and adjust for overdispersion we need to add the scale parameter by changing scale=none to scale=pearson; see the third part of the SAS program crab.sas [2] labeled 'Adjust for overdispersion by "scale=pearson" '. (As stated earlier we can also fit a negative binomial regression instead (also see the crab.sas [2] code)
Below is the output when you run the second part of the program labeled 'Adjust for overdispersion by "scale=pearson" '. How does this compare to the output above from the earlier stage of the code? Do we have a better fit now? Consider the "Scaled Deviance" and "Scaled Pearson chisquare" statistics? The overall model seems to fit better when we account for possible overdispersion with pvalue about 0.33 for $G^2=178.454, df=171$.
With this model the random component does not have a Poisson distribution any more where the response has the same mean and variance. From the estimate given (e.g., Pearson X^{2} / 171 = 3.1822), the variance of random component (response, the number of satellites for each Width) is roughly three times the size of the mean.
The new standard errors (in comparison to the model where scale = 1), are larger, e.g., 0.0356 = 1.7839 × 0.02 which comes from the $scaled SE=\sqrt{Value/df}*original SE$. Thus the Wald X^{2} statistics will be smaller, e.g., 21.22 = 67.21 / 3.1822. Note that sqrt(3.1822) = 1.7839.
What could be another reason for poor fit besides overdispersion? How about missing other explanatory variables? Can we improve the fit by adding other variables?
Since we did not use the \$ sign in the input statement to specify that the variable "c" was categorical, we can now do it by using class c as seen below. The following change is reflected in next section of the crab.sas [2] program labeled 'Add one more variable as a predictor, "color" '.
Again compare the parts of this output with the initial crab.lst [3] output. Note the "Class level information" on 'color' indicating that this variable has 4 levels and thus are we are introducing "dummy variables" into the model. From the "Analysis of Parameter Estimates" output below we see that the reference level is level #4 (see the shaded output below). We are also adjusting for overdispersion but by using deviance instead of X^{2}, although scale=pearson is preferred; we are doing this to demonstrate options in SAS and since the values are close, it doesn't matter which option we are using!
The estimated model is: $\log{\hat{\mu_i}}$= 3.0974 + 0.1493W + 0.4474(C="1") + 0.2477(C="2") + 0.0110(C="3").
There does not seem to be a difference in the number of satellites between any color class and the reference level 4 according to the chisquared statistics for each row in the table above. Furthermore, by Type 3 Analysis output below we see that the color is not statistically significant predictor after we take the width into consideration.
LR Statistics For Type 3 Analysis Chi Source Num DF Den DF F Value Pr > F Square Pr > ChiSq W 1 168 14.96 0.0002 14.96 0.0001 C 3 168 0.85 0.4661 2.56 0.4640
Does this model fit the data better, with and without the adjusting for overdispersion?
We will run another part of the crab.sas [2] program labeled "Include color as a Numerical Predictor" by running the following change,
Compare these partial parts of the output with the output above where we used color as a categorical predictor. We are doing this just to keep in mind that different coding of the same variable will give you different fits and estimates.
What is the estimated model now? $\log{\hat{\mu_i}}$ = 2.520 + 0.1496W  0.1694C.
Since adding a covariate doesn't help, the overdispersion seems to be due to heterogeneity. Is there something else we can do with this data? We can either (1) consider different methods, e.g., small area estimation, etc.. , (2) collapse over levels of explanatory variables, or (3) transform the variables.
Let's consider grouping the data by the widths and then fitting a Poisson regression model for the rates.
Here is the sorted data by W. The columns are in the following order:
Widths, # Satellites, and Cumulative # of Satellites: Sorted
W Sa Cum Sa W Sa Cum Sa W Sa Cum Sa
1 21.0 0 0 25.3 2 103 27.3 1 270
2 22.0 0 0 25.4 6 109 27.4 5 275
3 22.5 0 0 25.4 4 113 27.4 6 281
4 22.5 1 1 25.4 0 113 27.4 3 284
5 22.5 4 5 25.5 0 113 27.5 6 290
6 22.9 4 9 25.5 0 113 27.5 9 299
7 22.9 0 9 25.5 0 113 27.5 1 300
8 22.9 0 9 25.6 0 113 27.5 6 306
9 23.0 1 10 25.6 7 120 27.5 0 306
10 23.0 0 10 25.7 8 128 27.5 3 309
11 23.1 0 10 25.7 5 133 27.6 4 313
12 23.1 0 10 25.7 0 133 27.7 6 319
13 23.1 0 10 25.7 0 133 27.7 5 324
14 23.2 4 14 25.7 0 133 27.8 0 324
15 23.4 0 14 25.7 0 133 27.8 3 327
16 23.5 0 14 25.8 10 143 27.9 7 334
17 23.7 0 14 25.8 0 143 27.9 6 340
18 23.7 0 14 25.8 0 143 28.0 0 340
19 23.7 0 14 25.8 0 143 28.0 1 341
20 23.8 0 14 25.8 3 146 28.0 4 345
21 23.8 0 14 25.8 0 146 28.2 6 351
22 23.8 6 20 25.8 0 146 28.2 8 359
23 23.9 2 22 25.9 4 150 28.2 11 370
24 24.0 0 22 26.0 4 154 28.2 1 371
25 24.0 10 32 26.0 3 157 28.3 8 379
26 24.1 0 32 26.0 14 171 28.3 15 394
27 24.2 0 32 26.0 9 180 28.3 0 394
28 24.2 2 34 26.0 5 185 28.4 3 397
29 24.3 0 34 26.0 3 188 28.4 5 402
30 24.3 0 34 26.1 5 193 28.5 0 402
31 24.5 5 39 26.1 3 196 28.5 1 403
32 24.5 1 40 26.2 0 196 28.5 9 412
33 24.5 1 41 26.2 3 199 28.5 3 415
34 24.5 6 47 26.2 3 202 28.7 0 415
35 24.5 0 47 26.2 0 202 28.7 3 418
36 24.5 0 47 26.2 0 202 28.9 4 422
37 24.5 0 47 26.2 0 202 29.0 1 423
38 24.7 0 47 26.2 2 204 29.0 4 427
39 24.7 5 52 26.2 2 206 29.0 10 437
40 24.7 0 52 26.3 1 207 29.0 3 440
41 24.7 4 56 26.5 1 208 29.0 1 441
42 24.7 4 60 26.5 4 212 29.0 1 442
43 24.8 0 60 26.5 0 212 29.3 4 446
44 24.9 0 60 26.5 0 212 29.3 12 458
45 24.9 6 66 26.5 4 216 29.5 4 462
46 24.9 0 66 26.5 7 223 29.7 5 467
47 25.0 3 69 26.7 5 228 29.8 4 471
48 25.0 2 71 26.7 2 230 30.0 8 479
49 25.0 8 79 26.7 0 230 30.0 9 488
50 25.0 5 84 26.8 5 235 30.0 5 493
51 25.0 6 90 26.8 0 235 30.2 2 495
52 25.0 4 94 26.8 0 235 30.3 3 498
53 25.1 5 99 27.0 3 238 30.5 3 501
54 25.1 0 99 27.0 3 241 31.7 4 505
55 25.2 1 100 27.0 6 247 31.9 2 507
56 25.2 1 101 27.0 6 253 33.5 7 514
57 27.0 0 253
58 27.1 8 261
59 27.1 0 261
60 27.2 5 266
61 27.2 3 269
Plot of Number of Satellites by Width of Crab—All Data


Plot of Average Number of Satellites by Width of Crab—Distinct Widths


Plot of Average Number of Satellites by Width— Widths Grouped

The data have been grouped into 8 intervals, as shown in the (grouped) data below:
Interval NumCases AverWt AverSa SDSa VarSa
=23.25 14 22.693 1.0000 1.6641 2.7692
23.2524.25 14 23.843 1.4286 2.9798 8.8792
24.2525.25 28 24.775 2.3929 2.5581 6.5439
25.2526.25 39 25.838 2.6923 3.3729 11.3765
26.2627.25 22 26.791 2.8636 2.6240 6.8854
27.2528.25 24 27.738 3.8750 2.9681 8.8096
28.2529.25 18 28.667 3.9444 4.1084 16.8790
>29.25 14 30.407 5.1429 2.8785 8.2858
Let’s run crabrate.sas [5] (this is a new program not found in crab.sas). In this program we entered the grouped data above. In this case, each observation within a category is treated as if it has the same width.
We also create a variable LCASES=log(CASES) which takes the log of the number of cases within each grouping. This is our adjustment value 't' in the model that represents the fixed space, in this case the group (crabs with similar width). We thus form a rate of satellites for each group by dividing by each group size, and are fitting a loglinear model to rate of satellites incidence given the crab's width.
Here is the output. Note "Offset variable" under the "Model Information".
Does the model now fit better or worse than before? It clearly fits better. For example the Value/DF for the deviance statistic now is 1.0861.
The estimated model is: $log (\hat{\mu}_i/t)$ = 3.535 + 0.1727width_{i}
As the width increases, the rate of satellites cases changes by exp(0.1727).
The residuals analysis indicate the good fit as well. From the table above we also see that the predicted number of counts ('Pred') cells correspond a bit better to the observed counts in the "SaTotal" cells. Let's compare the observed and fitted values in the plot below:
This last part is the output from crabrate.sas [5] where just for demonstration we fit the Poisson regression model with the identity link for the rate data. Notice that this model does NOT fit well for the grouped data as the Value/DF for deviance statistic is about 11.649.
Links:
[1] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab.txt
[2] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab.sas
[3] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab.lst
[4] https://onlinecourses.science.psu.edu/stat504/node/57
[5] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab5.sas