9.2 - Modeling Count Data

Example: Horseshoe Crabs and Satellites Section

image of horseshoe crab This problem refers to data from a study of nesting horseshoe crabs (J. Brockmann, Ethology 1996). Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing near her. Explanatory variables that are thought to affect this included the female crab's color, spine condition, and carapace width, and weight. The response outcome for each female crab is the number of satellites. There are 173 females in this study.

Let's first see if the carapace width can explain the number of satellites attached. We will start by fitting a Poisson regression model with carapace width as the only predictor.

proc genmod data=crab;
model Sa=w / dist=poi link=log obstats;
run;

Model Sa=w specifies the response (Sa) and predictor width (W). Also, note that specifications of Poisson distribution are dist=pois and link=log. The obstats option as before will give us a table of observed and predicted values and residuals. We can use any additional options in GENMOD, e.g., TYPE3, etc.

Here is the output that we should get from running just this part:

Model Information

What do we learn from the "Model Information" section? How is this different from when we fitted logistic regression models? This section gives information on the GLM that's fitted. More specifically, we see that the response is distributed via Poisson, the link function is log, and the dependent variable is Sa. The number of observations in the data set used is 173.

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CRAB
Distribution	Poisson
Link Function	Log
Dependent Variable	Sa


Number of Observations Read	173
Number of Observations Used	173

Does the model fit well? What does the Value/DF tell us? Given the value of deviance statistic of 567.879 with 171 df, the p-value is zero and the Value/DF is much bigger than 1, so the model does not fit well. The lack of fit may be due to missing data, predictors, or overdispersion. It should also be noted that the deviance and Pearson tests for lack of fit rely on reasonably large expected Poisson counts, which are mostly below five, in this case, so the test results are not entirely reliable. Still, we'd like to see a better-fitting model if possible.


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	171	567.8786	3.3209
Scaled Deviance	171	567.8786	3.3209
Pearson Chi-Square	171	544.1570	3.1822
Scaled Pearson X2	171	544.1570	3.1822
Log Likelihood		68.4463
Full Log Likelihood		-461.5881
AIC (smaller is better)		927.1762
AICC (smaller is better)		927.2468
BIC (smaller is better)		933.4828


Algorithm converged.

Is width a significant predictor? From the "Analysis of Parameter Estimates" table, with Chi-Square stats of 67.51 (1df), the p-value is 0.0001 and this is significant evidence to reject the null hypothesis that $\beta_W=0$. We can conclude that the carapace width is a significant predictor of the number of satellites.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.3048	0.5422	-4.3675	-2.2420	37.14	<.0001
W	1	0.1640	0.0200	0.1249	0.2032	67.51	<.0001
Scale	0	1.0000	0.0000	1.0000	1.0000

Note:The scale parameter was held fixed.

The estimated model is: $\log (\mu_i) = -3.3048 + 0.164W_i$

The standard error of the estimated slope is 0.020, which is small, and the slope is statistically significant.

model = glm(Sa ~ W, family=poisson(link=log), data=hcrabs)
summary(model)
model$fitted.values
rstandard(model)

Just as with logistic regression, the glm function specifies the response (Sa) and predictor width (W) separated by the "~" character. Also, note the specification of the Poisson distribution and link function. The fitted (predicted) values are the estimated Poisson counts, and rstandard reports the standardized deviance residuals.

Here is the output that we should get from the summary command:

> summary(model)

Call:
glm(formula = Sa ~ W, family = poisson(link = log), data = hcrabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8526  -1.9884  -0.4933   1.0970   4.9221  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
W            0.16405    0.01997   8.216  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 632.79  on 172  degrees of freedom
Residual deviance: 567.88  on 171  degrees of freedom
AIC: 927.18

Number of Fisher Scoring iterations: 6

Does the model fit well? Given the value of deviance statistic of 567.879 with 171 df, the p-value is zero and the Value/DF is much bigger than 1, so the model does not fit well. The lack of fit may be due to missing data, predictors, or overdispersion. It should also be noted that the deviance and Pearson tests for lack of fit rely on reasonably large expected Poisson counts, which are mostly below five, in this case, so the test results are not entirely reliable. Still, we'd like to see a better-fitting model if possible.

Is width a significant predictor? From the "Coefficients" table, with Chi-Square stat of $8.216^2=67.50$ (1df), the p-value is 0.0001, and this is significant evidence to reject the null hypothesis that $\beta_W=0$. We can conclude that the carapace width is a significant predictor of the number of satellites.

The estimated model is: $\log (\mu_i) = -3.3048 + 0.164W_i$

The standard error of the estimated slope is 0.020, which is small, and the slope is statistically significant.

Interpretation

Since the estimate of $\beta > 0$, the wider the carapace is, the greater the number of male satellites (on average). Specifically, for each 1-cm increase in carapace width, the expected number of satellites is multiplied by $\exp(0.1640) = 1.18$.

Looking at the standardized residuals, we may suspect some outliers (e.g., the 15th observation has a standardized deviance residual of almost 5!), but these seem less obvious in the scatterplot, given the overall variability. Also, with a sample size of 173, such extreme values are more likely to occur just by chance. Still, this is something we can address by adding additional predictors or with an adjustment for overdispersion. From the observations statistics, we can also see the predicted values (estimated mean counts) and the values of the linear predictor, which are the log of the expected counts. For example, for the first observation, the predicted value is $\hat{\mu}_1=3.810$, and the linear predictor is $\log(3.810)=1.3377$.

From the above output, we see that width is a significant predictor, but the model does not fit well. We can further assess the lack of fit by plotting residuals or influential points, but let us assume for now that we do not have any other covariates and try to adjust for overdispersion to see if we can improve the model fit.

Strategies to Improve the Fit

Adjusting for Overdispersion Section

In the above model, we detect a potential problem with overdispersion since the scale factor, e.g., Value/DF, is greater than 1.

What does overdispersion mean for Poisson Regression? What does it tell us about the relationship between the mean and the variance of the Poisson distribution for the number of satellites? Recall that one of the reasons for overdispersion is heterogeneity, where subjects within each predictor combination differ greatly (i.e., even crabs with similar width have a different number of satellites). If that's the case, which assumption of the Poisson model is violated?

As we saw in logistic regression, if we want to test and adjust for overdispersion we can add a scale parameter by changing scale=none to scale=pearson; see the third part of the SAS program crab.sas labeled 'Adjust for overdispersion by "scale=pearson" '. (As stated earlier we can also fit a negative binomial regression instead)

proc genmod data=crab;
model Sa=w / dist=poi link=log scale=pearson;
output out=data2 p=pred;
run;
proc print;
run;

Below is the output when using "scale=pearson". How does this compare to the output above from the earlier stage of the code? Do we have a better fit now? Consider the "Scaled Deviance" and "Scaled Pearson chi-square" statistics. The overall model seems to fit better when we account for possible overdispersion.

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CRAB
Distribution	Poisson
Link Function	Log
Dependent Variable	Sa


Number of Observations Read	173
Number of Observations Used	173


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	171	567.8786	3.3209
Scaled Deviance	171	178.4544	1.0436
Pearson Chi-Square	171	544.1570	3.1822
Scaled Pearson X2	171	171.0000	1.0000
Log Likelihood		21.5091
Full Log Likelihood		-461.5881
AIC (smaller is better)		927.1762
AICC (smaller is better)		927.2468
BIC (smaller is better)		933.4828


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.3048	0.9673	-5.2006	-1.4089	11.67	0.0006
W	1	0.1640	0.0356	0.0942	0.2339	21.22	<.0001
Scale	0	1.7839	0.0000	1.7839	1.7839

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

The SAS System


Obs	Obs	C	S	W	Sa	Wt	pred
1	1	3	3	28.3	8	3.050	3.81034
2	2	4	3	22.5	0	1.550	1.47146
3	3	2	1	26.0	9	2.300	2.61278
4	4	4	3	24.8	0	2.100	2.14590
5	5	4	3	26.0	4	2.600	2.61278
6	6	3	3	23.8	0	2.100	1.82124
7	7	2	1	26.5	0	2.350	2.83612
8	8	4	2	24.7	0	1.900	2.11099
...	...	...	...	...	...	...	...
170	170	4	3	29.0	4	3.275	4.27400
171	171	2	1	28.0	0	2.625	3.62736
172	172	5	3	27.0	0	2.625	3.07855
173	173	3	2	24.5	0	2.000	2.04285

As we saw in logistic regression, if we want to test and adjust for overdispersion we can add a scale parameter with the family=quasipoisson option. The estimated scale parameter will be labeled as "Overdispersion parameter" in the output. (As stated earlier we can also fit a negative binomial regression instead)

model.disp = glm(Sa ~ W, family=quasipoisson(link=log), data=hcrabs)
summary.glm(model.disp)
summary.glm(model.disp)$dispersion

Below is the output when using the quasi-Poisson model. How does this compare to the output above from the earlier stage of the code? Do we have a better fit now?

> summary.glm(model.disp)

Call:
glm(formula = Sa ~ W, family = quasipoisson(link = log), data = hcrabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8526  -1.9884  -0.4933   1.0970   4.9221  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.30476    0.96729  -3.417 0.000793 ***
W            0.16405    0.03562   4.606 7.99e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 3.182205)

    Null deviance: 632.79  on 172  degrees of freedom
Residual deviance: 567.88  on 171  degrees of freedom

With this model, the random component does not technically have a Poisson distribution any more (hence the term "quasi" Poisson) because that would require that the response has the same mean and variance. From the estimate given (Pearson $X^2/171 = 3.1822$), the variance of the number of satellites is roughly three times the size of the mean.

The new standard errors (in comparison to the model without the overdispersion parameter), are larger, (e.g., $0.0356 = 1.7839( 0.02)$ which comes from the scaled SE ($\sqrt{3.1822}=1.7839$); the adjusted standard errors are multiplied by the square root of the estimated scale parameter. Thus, the Wald statistics will be smaller and less significant.

What could be another reason for poor fit besides overdispersion? Can we improve the fit by adding other variables?

Include color as a Qualitative Predictor Section

Since we did not use the \$ sign in the input statement to specify that the variable "C" was categorical, we can now do it by using class c as seen below. The following change is reflected in the next section of the crab.sas program labeled 'Add one more variable as a predictor, "color" '.

proc genmod data=crab;
class c;
model Sa=w c / dist=poi link=log  scale=pearson type3;
run;

Note the "Class level information" on color indicates that this variable has four levels, and thus are we are introducing three indicator variables into the model. From the "Analysis of Parameter Estimates" output below we see that the reference level is level 5. We continue to adjust for overdispersion with scale=pearson , although we could relax this if adding additional predictor(s) produced an insignificant lack of fit.

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CRAB
Distribution	Poisson
Link Function	Log
Dependent Variable	Sa


Number of Observations Read	173
Number of Observations Used	173


Class Level Information
Class	Levels	Values
C	4	2 3 4 5


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	168	559.3448	3.3294
Scaled Deviance	168	172.9776	1.0296
Pearson Chi-Square	168	543.2490	3.2336
Scaled Pearson X2	168	168.0000	1.0000
Log Likelihood		22.4866
Full Log Likelihood		-457.3212
AIC (smaller is better)		924.6425
AICC (smaller is better)		925.0017
BIC (smaller is better)		940.4089


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept		1	-3.0974	1.0026	-5.0625	-1.1323	9.54	0.0020
W		1	0.1493	0.0375	0.0759	0.2228	15.88	<.0001
C	2	1	0.4474	0.3760	-0.2897	1.1844	1.42	0.2342
C	3	1	0.2477	0.2934	-0.3274	0.8227	0.71	0.3986
C	4	1	0.0110	0.3244	-0.6249	0.6468	0.00	0.9730
C	5	0	0.0000	0.0000	0.0000	0.0000	.	.
Scale		0	1.7982	0.0000	1.7982	1.7982

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

The estimated model is: $\log{\hat{\mu_i}}= -3.0974 + 0.1493W_i + 0.4474C_{2i} + 0.2477C_{3i} + 0.0110C_{4i}$, using indicator variables for the first three colors.

There does not seem to be a difference in the number of satellites between any color class and the reference level 5 according to the chi-squared statistics for each row in the table above. Furthermore, by the Type 3 Analysis output below we see that color overall is not statistically significant after we consider the width.


LR Statistics For Type 3 Analysis
Source	Num DF	Den DF	F Value	Pr > F	Chi-Square	Pr > ChiSq
W	1	168	15.40	0.0001	15.40	<.0001
C	3	168	0.88	0.4529	2.64	0.4507

To add the horseshoe crab color as a categorical predictor (in addition to width), we can use the following code. The change of baseline to the 5th color is arbitrary.

hcrabs$C = factor(hcrabs$C, levels=c('5','2','3','4'))
model = glm(Sa ~ W+C, family=quasipoisson(link=log), data=hcrabs)
summary(model)
anova(model, test='Chisq')

Note in the output that there are three separate parameters estimated for color, corresponding to the three indicators included for colors 2, 3, and 4 (5 as the baseline). We continue to adjust for overdispersion withfamily=quasipoisson, although we could relax this if adding additional predictor(s) produced an insignificant lack of fit.

> summary(model)

Call:
glm(formula = Sa ~ W + C, family = quasipoisson(link = log), 
    data = hcrabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0415  -1.9581  -0.5575   0.9830   4.7523  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.09740    1.00260  -3.089  0.00235 ** 
W            0.14934    0.03748   3.985  0.00010 ***
C2           0.44736    0.37604   1.190  0.23586    
C3           0.24767    0.29339   0.844  0.39978    
C4           0.01100    0.32442   0.034  0.97300    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 3.233628)

    Null deviance: 632.79  on 172  degrees of freedom
Residual deviance: 559.34  on 168  degrees of freedom

The estimated model is: $\log{\hat{\mu_i}}= -3.0974 + 0.1493W_i + 0.4474C_{2i}+ 0.2477C_{3i}+ 0.0110C_{4i}$, using indicator variables for the first three colors.

There does not seem to be a difference in the number of satellites between any color class and the reference level 5 according to the chi-squared statistics for each row in the table above. Furthermore, by the ANOVA output below we see that color overall is not statistically significant after we consider the width.

> anova(model, test='Chisq')
Analysis of Deviance Table

Model: quasipoisson, link: log

Response: Sa

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                   172     632.79              
W     1   64.913       171     567.88 7.449e-06 ***
C     3    8.534       168     559.34    0.4507    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Is this model preferred to the one without color?

Including Color as a Numerical Predictor Section

As it turns out, the color variable was actually recorded as ordinal with values 2 through 5 representing increasing darkness and may be quantified as such. So, we next consider treating color as a quantitative variable, which has the advantage of allowing a single slope parameter (instead of multiple indicator slopes) to represent the relationship with the number of satellites.

We will run another part of the crab.sas program that does not include color as a categorical by removing the class statement for C:

proc genmod data=crab;
model Sa=w c / dist=poi link=log  scale=pearson;
run;

Compare these partial parts of the output with the output above where we used color as a categorical predictor. We are doing this to keep in mind that different coding of the same variable will give us different fits and estimates.

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CRAB
Distribution	Poisson
Link Function	Log
Dependent Variable	Sa


Number of Observations Read	173
Number of Observations Used	173


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	170	560.2013	3.2953
Scaled Deviance	170	173.5034	1.0206
Pearson Chi-Square	170	548.8898	3.2288
Scaled Pearson X2	170	170.0000	1.0000
Log Likelihood		22.3878
Full Log Likelihood		-457.7495
AIC (smaller is better)		921.4990
AICC (smaller is better)		921.6410
BIC (smaller is better)		930.9589


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-2.3506	1.1516	-4.6076	-0.0936	4.17	0.0412
W	1	0.1496	0.0372	0.0767	0.2224	16.20	<.0001
C	1	-0.1694	0.1111	-0.3872	0.0484	2.32	0.1274
Scale	0	1.7969	0.0000	1.7969	1.7969

Note:The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.

To add color as a quantitative predictor, we first define it as a numeric variable. Although the original values were 2, 3, 4, and 5, R will by default use 1 through 4 when converting from factor levels to numeric values. So, we add 1 after the conversion.

data(hcrabs)
colnames(hcrabs) = c("C","S","W","Sa","Wt")
hcrabs$C = as.numeric(hcrabs$C)+1
model = glm(Sa ~ W+C, family=quasipoisson(link=log), data=hcrabs)
summary(model)
anova(model, test='Chisq')

Compare these partial parts of the output with the output above where we used color as a categorical predictor. We are doing this to keep in mind that different coding of the same variable will give us different fits and estimates.

> summary(model)

Call:
glm(formula = Sa ~ W + C, family = quasipoisson(link = log), 
    data = hcrabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9919  -1.9539  -0.5526   0.9836   4.7596  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.35058    1.15156  -2.041   0.0428 *  
W            0.14957    0.03716   4.025 8.55e-05 ***
C           -0.16940    0.11112  -1.524   0.1292    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 3.228764)

    Null deviance: 632.79  on 172  degrees of freedom
Residual deviance: 560.20  on 170  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

> anova(model, test='Chisq')
Analysis of Deviance Table

Model: quasipoisson, link: log

Response: Sa

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                   172     632.79              
W     1   64.913       171     567.88 7.332e-06 ***
C     1    7.677       170     560.20    0.1231    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The estimated model is now

$\log{\hat{\mu_i}}= -2.3506 + 0.1496W_i - 0.1694C_i$

And the interpretation of the single slope parameter for color is as follows: for each 1-unit increase in the color (darkness level), the expected number of satellites is multiplied by $\exp(-.1694)=.8442$.

In terms of the fit, adding the numerical color predictor doesn't seem to help; the overdispersion seems to be due to heterogeneity. Is there something else we can do with this data? We can either (1) consider additional variables (if available), (2) collapse over levels of explanatory variables, or (3) transform the variables.

Let's consider grouping the data by the widths and then fitting a Poisson regression model that models the rate of satellites per crab.

Collapsing over Explanatory Variable Width Section

In this approach, we create 8 width groups and use the average width for the crabs in that group as the single representative value. While width is still treated as quantitative, this approach simplifies the model and allows all crabs with widths in a given group to be combined. The disadvantage is that differences in widths within a group are ignored, which provides less information overall. However, another advantage of using the grouped widths is that the saturated model would have 8 parameters, and the goodness of fit tests, based on $8-2$ degrees of freedom, are more reliable.

To account for the fact that width groups will include different numbers of crabs, we will model the mean rate $\mu/t$ of satellites per crab, where $t$ is the number of crabs for a particular width group. This variable is treated much like another predictor in the data set. The difference is that this value is part of the response being modeled and not assigned a slope parameter of its own. We will see more details on the Poisson rate regression model in the next section.

The data, after being grouped into 8 intervals, is shown in the table below. Two columns to note in particular are "Cases", the number of crabs with carapace widths in that interval, and "Width", which now represents the average width for the crabs in that interval.

    Interval     Cases   Width   AvgSa    SDSa    VarSa
      =23.25        14  22.693  1.0000  1.6641   2.7692
 23.25-24.25        14  23.843  1.4286  2.9798   8.8792
 24.25-25.25        28  24.775  2.3929  2.5581   6.5439
 25.25-26.25        39  25.838  2.6923  3.3729  11.3765
 26.26-27.25        22  26.791  2.8636  2.6240   6.8854
 27.25-28.25        24  27.738  3.8750  2.9681   8.8096
 28.25-29.25        18  28.667  3.9444  4.1084  16.8790
      >29.25        14  30.407  5.1429  2.8785   8.2858

In this approach, each observation within a group is treated as if it has the same width.

Plot of Number of Satellites by Width of Crab—All Data

Plot of Average Number of Satellites by Width— Widths Grouped

In SAS, the Cases variable is input with the OFFSET option in the Model statement. We also create a variable LCASES=log(CASES) which takes the log of the number of cases within each grouping. This is our adjustment value $t$ in the model that represents (abstractly) the measurement window, which in this case is the group of crabs with a similar width. So, $t$ is effectively the number of crabs in the group, and we are fitting a model for the rate of satellites per crab, given carapace width.

data crabgrp;
input width cases SaTotal AvgSa;
lcases=log(cases);
cards;
22.69 14 14 1.00
23.84 14 20 1.428571
24.77 28 67 2.392857
25.84 39 105 2.692308
26.79 22 63 2.863636
27.74 24 93 3.875
28.67 18 71 3.944444
30.41 14 72 5.142857
;
proc genmod data=crabgrp order=data;
model SaTotal = width / dist=poisson link=log 
 offset=lcases obstats residuals;
 output out=data3 p=pred;
run;

Here is the output. Note "Offset variable" under the "Model Information".

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CRABGRP
Distribution	Poisson
Link Function	Log
Dependent Variable	SaTotal
Offset Variable	lcases


Number of Observations Read	8
Number of Observations Used	8


Parameter Information
Parameter	Effect
Prm1	Intercept
Prm2	width


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	6	6.5168	1.0861
Scaled Deviance	6	6.5168	1.0861
Pearson Chi-Square	6	6.2470	1.0412
Scaled Pearson X2	6	6.2470	1.0412
Log Likelihood		1652.1028
Full Log Likelihood		-26.4809
AIC (smaller is better)		56.9617
AICC (smaller is better)		59.3617
BIC (smaller is better)		57.1206


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.5355	0.5760	-4.6645	-2.4065	37.67	<.0001
width	1	0.1727	0.0212	0.1311	0.2143	66.18	<.0001
Scale	0	1.0000	0.0000	1.0000	1.0000

Note:The scale parameter was held fixed.

Based on the Pearson and deviance goodness of fit statistics, this model clearly fits better than the earlier ones before grouping width. For example, the Value/DF for the deviance statistic now is 1.0861.

The estimated model is: $\log (\hat{\mu}_i/t) = -3.535 + 0.1727\mbox{width}_i$

For each 1-cm increase in carapace width, the mean number of satellites per crab is multiplied by $\exp(0.1727)=1.1885$.


Observation Statistics
Observation	SaTotal	lcases	width	Predicted Value	Linear Predictor	Standard Error of the Linear Predictor	HessWgt	Lower	Upper	Raw Residual	Pearson Residual	Deviance Residual	Std Deviance Residual	Std Pearson Residual	Likelihood Residual	Leverage	CookD	DFBETA_Intercept	DFBETA_width	DFBETAS_Intercept	DFBETAS_width
1	14	2.6390573	22.69	20.544353	3.0225861	0.1027001	20.544353	16.798637	25.12528	-6.544353	-1.443845	-1.532938	-1.732037	-1.631372	-1.710727	0.2166875	0.3681075	-0.460663	0.0164188	-0.799711	0.7733011
2	20	2.6390573	23.84	25.058503	3.2212132	0.0813848	25.058503	21.363886	29.392059	-5.058503	-1.010519	-1.047745	-1.14727	-1.106509	-1.140606	0.1659747	0.1218267	-0.24937	0.008775	-0.432905	0.4132908
3	67	3.3322045	24.77	58.849849	4.0749893	0.0657446	58.849849	51.734881	66.943322	8.1501507	1.062412	1.039205	1.2034817	1.2303572	1.2103746	0.2543698	0.2582108	0.3254536	-0.011232	0.5649873	-0.528993
4	105	3.6635616	25.84	98.608352	4.591156	0.0513763	98.608352	89.162468	109.05493	6.3916476	0.6436592	0.636887	0.740506	0.74838	0.7425635	0.2602795	0.0985341	0.144533	-0.004711	0.2509092	-0.221869
5	63	3.0910425	26.79	65.543892	4.18272	0.0448389	65.543892	60.029581	71.564747	-2.543892	-0.314219	-0.316285	-0.33944	-0.337223	-0.339149	0.1317776	0.0086301	-0.015069	0.0003426	-0.026159	0.0161352
6	93	3.1780538	27.74	84.252198	4.4338147	0.0468532	84.252198	76.859888	92.355494	8.7478019	0.9530338	0.9372168	1.0381222	1.0556422	1.0413848	0.1849521	0.1264386	-0.069134	0.0033416	-0.120017	0.1573828
7	71	2.8903718	28.67	74.1998	4.3067615	0.0562513	74.1998	66.454059	82.848368	-3.1998	-0.371468	-0.374187	-0.427757	-0.424648	-0.427029	0.2347839	0.0276639	0.0743554	-0.003055	0.129081	-0.143886
8	72	2.6390573	30.41	77.943052	4.3559785	0.0840923	77.943052	66.099465	91.908752	-5.943052	-0.673164	-0.682002	-1.017999	-1.004806	-1.010749	0.5511749	0.6199362	0.5164022	-0.02006	0.8964738	-0.944816

The residuals analysis indicates a good fit as well. From the table above we also see that the predicted values correspond a bit better to the observed counts in the "SaTotal" cells. Let's compare the observed and fitted values in the plot below:

Scatter Plot of Sa Total, Observed vs Predicted

Caption needed here

In R, the lcases variable is specified with the OFFSET option, which takes the log of the number of cases within each grouping. This is our adjustment value $t$ in the model that represents (abstractly) the measurement window, which in this case is the group of crabs with similar width. So, $t$ is effectively the number of crabs in the group, and we are fitting a model for the rate of satellites per crab, given carapace width.

# creating the 8 width groups
width.long = cut(hcrabs$W, breaks=c(0,seq(23.25, 29.25),Inf))

# summarizing per group
Cases = aggregate(hcrabs$W, by=list(W=width.long),length)$x
lcases = log(Cases)
Width = aggregate(hcrabs$W, by=list(W=width.long),mean)$x
AvgSa = aggregate(hcrabs$Sa, by=list(W=width.long),mean)$x
SdSa = aggregate(hcrabs$Sa, by=list(W=width.long),sd)$x
VarSa = aggregate(hcrabs$Sa, by=list(W=width.long),var)$x
SaTotal = aggregate(hcrabs$Sa, by=list(W=width.long),sum)$x
cbind(table(width.long), Width, AvgSa, SdSa, VarSa)

hcrabs.grp = data.frame(Width=Width, lcases=lcases, SaTotal=SaTotal)
model.grp = glm(SaTotal ~ Width, offset=lcases, family=poisson, data=hcrabs.grp)
summary(model.grp)

Here is the output. Note the "offset = lcases" under the model expression.

 summary(model.grp)

Call:
glm(formula = SaTotal ~ Width, family = poisson, data = hcrabs.grp, 
    offset = lcases)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5322  -0.7750  -0.3454   0.7151   1.0347  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.54018    0.57658  -6.140 8.26e-10 ***
Width        0.17290    0.02125   8.135 4.11e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 72.3772  on 7  degrees of freedom
Residual deviance:  6.5164  on 6  degrees of freedom

Based on the Pearson and deviance goodness of fit statistics, this model clearly fits better than the earlier ones before grouping width. For example, the Value/DF for the deviance statistic now is 1.0861. The estimated model is:

$\log (\hat{\mu}_i/t)= -3.54 + 0.1729\mbox{width}_i$

For each 1-cm increase in carapace width, the mean number of satellites per crab is multiplied by $\exp(0.1729)=1.1887$.

> rstandard(model.grp)
         1          2          3          4          5          6          7          8 
-1.7312682 -1.1474695  1.1980218  0.7447990 -0.3414285  1.0400246 -0.4260710 -1.0213629 
> predict.glm(model.grp, type="response", newdata=hcrabs.grp)
       1        2        3        4        5        6        7        8 
20.54097 25.05952 58.88382 98.57261 65.55897 84.23620 74.18733 77.96057

The residuals analysis indicates a good fit as well, and the predicted values correspond a bit better to the observed counts in the "SaTotal" cells. Let's compare the observed and fitted values in the plot below:

Caption needed here