Review of the Poisson Distribution

Poisson regression is a type of a GLM model where the random component is specified by the Poisson distribution of the response variable which is a count. Before we look at the Poisson regression model, let’s quickly review the Poisson distribution.

We saw Poisson distribution and Poisson sampling at the beginning of the semester. In the analysis of the World Cup Soccer data, where we estimated the mean number of goals per team, and expected probabilities of teams scoring a certain number of goals (see Supplemental Review, and Lesson 1 on One-way Frequency Tables). If you feel comfortable with those already covered materials, you may skip the notes below and proceed to next section.

Situations in which there are many opportunities for some phenomena to occur but the chance that the phenomenon will occur in any given time interval, region of space or whatever is very small, lead to the distribution of the number of occurrences of the phenomena having a Poisson distribution. Here are some examples:

Number of earthquakes in a region (for example, California, Indonesia, Iran, Turkey, Mexico) in a specified period (five years?)of magnitudes greater than 5.0
Number of times lightning strikes in a 30 minute period in a region (like the state of Colorado)
Number of times an elderly person falls in a month.
Number of accidents on a highway in a certain area in a specified time
Number of misprints per page of a published manuscript.
Number of defects on a rug of size 9’ by 12’.
Number of male ‘satellites’ in the nesting area of a female crab
Number of telephone calls received at small business in a one-hour period.
Number of customers that enter a bank in a one hour period.
Number of errors (missing pulses?)detected on a computer disk
Number of cracks in a section of a highway.
Number of PC’s having a disk failure in a one day period at a moderately large company.

Example - Vacancies in the U.S. Supreme Court

Let’s look at one example in detail-- Vacancies in the U.S. Supreme Court. There are 9 members (Justices) of the U.S. Supreme Court. They are appointed by the President of the United States for life (or until they choose to resign). Let Y be the number of vacancies that occur in a given year in the Court. There are many opportunities for a Justice to die. Break up a year into hours (8760 hours in a non-leap year). A Justice could die (or resign) in any one of these hours, but the chance is very small. This is the kind of situation in which we would expect Y to have a Poisson distribution. Over the history of the court, the average number of vacancies per year has been about 0.5. The variable Y has a Poisson distribution with parameter λ = rate at which vacancies occur: λ = 0.5. A study of vacancies in the Court was once conducted over the period 1837-1932, spanning 96 years.

The number of vacancies by year would look like this:

Year	# of Vacancies
1837	1
1838	2
1839	0
1840	2
....	..
1930	1
1931	0
1932	0

Here is the distribution of the number of vacancies Y that occurred:

y	freq	proportion	(y)(proportion)
0	59	.6146	.0000
1	27	.2812	.2812
2	9	.0938	.1876
3	1	.0104	.0312
			= .5000

The average number of vacancies per year is [(56)(0) + (27)(1) + (9)(2) + (1)(3)]/96 = 48/96 = .5 = (0)(.6146) + (1)(.2812) + (2)(.0938) + (3)(.0104) = .5000.

Here are the theoretical probabilities of Y = 0, 1, 2, …, 7 vacancies assuming a Poisson distribution with parameter (rate) 0.5:

x	p(x)	CumProbx
0	0.606531	0.60653
1	0.303265	0.90980
2	0.075816	0.98561
3	0.012636	0.99825
4	0.001580	0.99983
5	0.000158	0.99999
6	0.000013	1.00000
7	0.000001	1.00000

We find the following from this: Prob(exactly 2 vacancies) = Prob(Y = 2) = .075816 and Prob(Y ≤ 2) = 0.98561 = Prob(at most 2 vacancies) = Prob (2 or fewer vacancies).

The Poisson distribution has mean (expected value) λ = 0.5 = μ and variance σ² = λ = 0.5, that is, the mean and variance are the same. We will later look at Poisson regression: we assume the response variable has a Poisson distribution (as an alternative to the normal

The corresponding probabilities for a rate = 2.0 (number of vacancies in four years) is as follows:

4 yr number	Prob	CumProb
0	0.135335	0.135335
1	0.270671	0.406006
2	0.270671	0.676676
3	0.180447	0.857123
4	0.090224	0.947347
5	0.036089	0.983436
6	0.012030	0.995466
7	0.003437	0.998903
8	0.000859	0.999763
9	0.000191	0.999954

For a rate of 2 per term (4 years), the mean and variance are both given by λ = 2.0

Inference for a Poisson Parameter λ

Just as we can carry out inferences for a proportion π, we can do similarly for a parameter λ of a Poisson distribution. Let Y₁, Y₂, Y₃, …, Y_N be a random sample from a Poisson population. For example, suppose that the number Y of vacancies in the U.S. Supreme Court has a Poisson distribution with parameter λ. We already looked at data on vacancies in the period 1837-1932, in which the average rate turned out to be exactly 0.50. Here are the data on vacancies in the period 1933-1990, spanning 58 years:

0 0 0 0 1 1 2 1 3 0 1 0 1 1 0 0 2 0 0 0 1 0 1 1 1 1 0 0 0
2 0 0 1 0 1 0 1 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 1 0 1

Here is a summary of the number of vacancies:

y	count	proportion
0	34	.586
1	18	.310
2	5	.086
3	1	.017

The average rate of vacancies in these 58 years is [(18)(1) + (5)(2) + (1)(3)] / 58 = 0.5345.

Was the rate over the past 58 years 0.5 or was it greater than 0.5 (note: this seems like a reasonable null hypothesis but it was suggested by the earlier data—nevertheless, let’s test that the rate is 0.5)? How can we test this?

Fact: The sum of Poisson random variables has a Poisson distribution with parameter the sum of the parameters of the individual variables: Assume Y_i has a Poisson distribution with parameter λ_i. Then

Y = Σ Y_i has a Poisson distribution with parameter λ = Σ λ_i .

Does this make sense? Suppose the rate at which events occur in one 'unit' is λ₁ and the rate at which they occur in a second unit is λ₂. For example, the rate at which vacancies in occur in one particular year is 1 and in another year it is 2. Then in the two years the rate is 1 + 2 = 3. Suppose the rate is λ per year. Then in 58 years the rate is 58 λ. We apply this fact to perform a test on the rate.

The null hypothesis says the rate is 0.50/year which means the rate for 58 years is (58)(0.50) = 29--we expect 29 vacancies in 58 years. We actually observed 31. What is the chance we'd observe 31 or more vacancies in 58 years? We find this probability (p-value) using Minitab, SAS, R or whatever your favorite stat software package is.

We show here how to do it in Minitab. Here is a part of the cumulative distribution of a Poisson distribution with λ = 29; Click Calc>Probability Distributions>Poisson. Put the numbers 0, 1, 2, …, 58 in C1. Click Calc>Make Patterned Data>simple set of Numbers. Store Patterned data in C1 (which is labeled below as 'y'), first number 0, last number 58, in steps of 1 (the default). Choose 'Cumulative Probability' option, type in 29 as the rate (mean), and choose C1 with the numbers 0-58 in it and store in C2 (which is labeled below as 'prob(y)'). Here is what you’d get:

** We get Prob(Y ≥ 31)

= 1 - Prob(Y ≤ 30) = 1 - .62058

= 0.37952 ~ 0.38 = p-value.

We would not reject the null hypothesis.

A probability histogram of the Poisson distribution with λ = 29 is given below. What does the distribution look like? Yeah, normal!

plot

Fact: if λ is large, one can approximate Poisson probabilities using the normal distribution with mean λ and standard deviation √λ.

Example: Find Prob(Y ≥ 31) using the normal approximation. Prob(Y ≥ 31) ~ Prob[Z > (30.5 - 29) / √29.] = Prob[ Z > .2785] = 1 - Prob(Z < .2785) = 1 - 0.6097 = .3903 ~ .39.

The approximation is excellent! (compare .38 with .39). Because of the above, we can carry out tests and calculate confidence intervals even for samples of size N = 1!

Example - Martial Arguments

Suppose a married couple, when asked how many 'arguments' they have per year, say 25. This is a rate, λ, and it is reasonable to assume the number Y of arguments/year has a Poisson distribution. Suppose they keep a careful count of their arguments and it turns out that they had 38. Should we believe their claim that the rate is 25 per year, vs. an alternative that it is greater?

If the null hypothesis is true, Y has a Poisson distribution with mean 25 and variance 25, so the standard deviation is 5. We approximate the probability of getting 38 or more arguments in a year using the normal distribution:

Cumulative Distribution Function

Normal with mean = 25.0000 and standard deviation = 5.00000

x	P( X <= x)
37.5000	0.9938

The p-value of the test is 1 - .9938 =.0062

How large does the rate parameter need to be to use the normal approximation? Probability plots for λ = 5 and 10 are given below.

The approximation looks pretty good for a rate of 10 but is rather skewed for a rate of 5.

Recommendation: Use the normal approximation if λ ≥ 10.

plot $plot\$

Tests and Confidence Intervals for the Difference of Two Poisson Parameters

Suppose we wish to compare two Poisson rates λ₁ and λ₂. We consider two cases, although the first is a special case of the second.

Case 1: Equal sample sizes. We observe Y₁₁, Y₁₂, …, Y_1n, and Y₂₁, Y₂₂, …, Y_2n, The test of the hypothesis H₀ : λ₁ = λ₂. vs. H_A : λ₁ ≠ λ₂. is based on the total number of occurrences of the phenomenon in each sample:

Y₁ = Σ Y_1j, and Y₂ = Σ Y_2j, .

If the null hypothesis is true, then the expected number of occurrences of the phenomenon are the same, or the proportion of occurrences in each sample is ½ the total N = Y₁ + Y₂ . Thus, we can test the hypothesis H₀ : λ₁ = λ₂ by testing the hypothesis that the proportion π of occurrences in each sample is .50. For N large, we use the z-test about one proportion

formula

We reject H₀ : λ₁ = λ₂ vs. H_A : λ₁ ≠ λ₂. if Z < -z_α/2 or if Z > z_α/2 .

Case 2. Unequal sample sizes. We observe Y₁₁, Y₁₂, …, Y_1n1, and Y₂₁, Y₂₂, …, Y_2n2, The test is the same as before but π = ½ is replaced by n₁/n , with n = n₁+ n₂ = total sample in the two samples. π = n₁/n is the proportion of the sample size from the first sample. The test statistic is:

formula

Example 1. The Robinson’s have Y₁ = 21 arguments during a given year and the Hu’s Y₂ = 15. Are they arguing at the same rate? Here, N = 21 + 15 = 36 and π = .50. Thus

formula

For α = .05, z_α/2 = 1.96, so we don’t reject.

Example 2. Vacancies in the U.S. Supreme Court. There were Y₁ = 48 vacancies in the U.S. supreme Court in the 96 years from 1837 to 1932 and Y₂ = 31 in the 58 years from 1933-1990. The total number of years is 96 + 58 =154, so the proportion of observations in the first sample (period from 1837-1932) = 96/154 =.6234. The total number of vacancies in the two samples is N = 48 + 31 = 79. If the two rates are equal, then we’d expect 62.34% of the vacancies to have occurred in the first 96 years. The test statistic Z is given by

formula

We reject the null hypothesis if |Z| > 1.96 (the critical value for α = .05). In this case, we would not reject the hypothesis that the rates are the same in the two periods. We would then estimate rate (based on 154 years) by 79/154 = .513 (almost exactly equal to the proportion of live male births in the U.S. (just an oddity!).