1.3 - Discrete Distributions

Statistical inference requires assumptions about the probability distribution (i.e., random mechanism, sampling model) that generated the data. For example, for a t-test, we assume that the sample mean follows a normal distribution. Some common distributions used for discrete data are introduced in this section.

Recall, a random variable is the outcome of an experiment (i.e., a random process) expressed as a number. We tend to use capital letters near the end of the alphabet (X, Y, Z, etc.) to denote random variables. Random variables are of two types: discrete and continuous. Here we are interested in distributions of discrete random variables.

A discrete random variable X is described by its probability mass function (PMF), which we will also call its distribution, \(f(x)=P(X =x)\). The set of x-values for which \(f (x) > 0\) is called the support. Support can be finite, e.g., X can take the values in \({0,1,2,\dots,n}\) or countably infinite if X takes values in \({0,1,\dots}\). Note, if the distribution depends on an unknown parameter \(\theta\) we can write it as \(f (x; \theta)\) or \(f(x| \theta)\).

Here are some distributions that you may encounter when analyzing discrete data.

Bernoulli distribution

The most basic of all discrete random variables is the Bernoulli.

X is said to have a Bernoulli distribution if \(X = 1\) occurs with probability \(\pi\) and \(X = 0\) occurs with probability \(1 − \pi\) ,

\(f(x)=\left\{\begin{array} {cl} \pi & x=1 \\ 1-\pi & x=0 \\ 0 & \text{otherwise} \end{array} \right. \)

Another common way to write it is...

\(f(x)=\pi^x (1-\pi)^{1-x}\text{ for }x=0,1\)

Suppose an experiment has only two possible outcomes, "success" and "failure," and let \(\pi\) be the probability of a success. If we let X denote the number of successes (either zero or one), then X will be Bernoulli. The mean (or expected value) of a Bernoulli random variable is

\(E(X)=1(\pi)+0(1-\pi)=\pi\),

and the variance is...

\(V(X)=E(X^2)-[E(X)]^2=1^2\pi+0^2(1-\pi)-\pi^2=\pi(1-\pi)\).

Binomial distribution

Suppose that \(X_1,X_2,\ldots,X_n\) are independent and identically distributed (iid) Bernoulli random variables, each having the distribution

\(f(x_i)=\pi^{x_i}(1-\pi)^{1-x_i}\text{ for }x_i=0,1 \text{ and } 0≤ \pi ≤ 1\)

Let \(X=X_1+X_2+\ldots+X_n\). Then X is said to have a binomial distribution with parameters n and p,

\(X\sim Bin(n,\pi)\).

For example, if a fair coin is tossed 100 times, the number of times heads is observed will have a binomial distribution (with \(n=100\) and \(\pi=.5\)). The binomial distribution has PMF

\(f(x)=\dfrac{n!}{x!(n-x)!} π^x (1-\pi)^{n-x} \text{ for }x_i=0,1,2,\ldots,n, \text{and } 0≤ \pi ≤ 1.\)

And by the independence assumption, we can show that

\(E(X)=E(X_1)+E(X_2)+\cdots+E(X_n)=n\pi\)

and

\(V(X)=V(X_1)+V(X_2)+\cdots+V(X_n)=n\pi(1-\pi)\).

Note that X will not have an exact binomial distribution if the probability of success \(\pi\) is not constant from trial to trial or if the trials are not independent (i.e., the outcome on one trial alters the probability of an outcome on another trial). However, the binomial distribution can still serve as an effective approximation if these violations are negligible.

Example: Smartphone users

For example, consider sampling 20 smartphone users in the U.S. and recording X = the number that use Android. If the nationwide percentage of Android users is \(\pi\), then X is approximately binomial with 20 trials and success probability \(\pi\), even though technically \(\pi\) would change slightly each time a user is pulled out of the population for sampling. As long as the population (all U.S. smartphone users) is large relative to the sample, this issue is negligible. If this is not the case, however, then we should account for this, which is what the hypergeometric distribution does.

Hypergeometric distribution

Suppose there's a population of \(n\) objects with \(n_1\) of type 1 (success) and \(n_2 = n − n_1\) of type 2 (failure), and m (less than n) objects are sampled without replacement from this population. Then, the number of successes X among the sample is a hypergeometric random variable with PMF

\(\displaystyle f(x) = \dfrac{\binom{n_1}{ x}\binom{n_2}{m - x}}{\binom{n}{m}},\;\;\;\; x \in [\max(0, m-n_2); \min(n_1, m)] \)

The restrictions are needed in the support because we cannot draw more successes or failures in the sample than what exist in the population. The expectation and variance of X are given by

\(E(X) =\dfrac{n_1m}{n}\) and \(V(X)=\dfrac{n_1n_2m(n-m)}{n^2(n-1)}\).

Poisson distribution

The Poisson distribution is another important one for modeling discrete events occurring in time or in space.

The PMF of a Poisson distribution is...

\(f(x)= Pr(X=x)= \dfrac{\lambda^x e^{-\lambda}}{x!}, x=0,1,2,\ldots, \mbox{ and }, \lambda>0.\)

For example, let \(X\) be the number of emails arriving at a server in one hour. Suppose that in the long run, the average number of emails arriving per hour is \(\lambda\). Then it may be reasonable to assume \(X \sim Poisson(\lambda)\). For the Poisson model to hold, however, the average arrival rate \(\lambda\) must be fairly constant over time; i.e., there should be no systematic or predictable changes in the arrival rate. Moreover, the arrivals should be independent of one another; i.e., the arrival of one email should not make the arrival of another email more or less likely.

The Poisson is also the limiting case of the binomial. Suppose that \(X\sim Bin(n,\pi)\) and let \(n\rightarrow\infty\) and \(\pi\rightarrow 0\) in such a way that \(n\pi\rightarrow\lambda\) where \(\lambda\) is a constant. Then, in the limit, \(X\sim Poisson(\lambda)\). Because of this, it is useful as an approximation to the binomial when \(n\) is large and \(\pi\) is small. That is, if \(n\) is large and \(\pi\) is small, then

\(\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x} \approx \dfrac{\lambda^x e^{-\lambda}}{x!}\)

where \(\lambda = n\pi\). The right-hand side above is typically easier to calculate than the left-hand side.

Another interesting property of the Poisson distribution is that \(E(X) = V(X) = \lambda\), and this may be too restrictive for some data, where the variance exceeds the mean. This is known as overdispersion and may require an adjustment to the Poisson assumption or a different distribution altogether. One such option is the negative binomial.

Negative-Binomial distribution

Whereas the binomial distribution describes the random number of successes in a fixed number of trials, the negative binomial distribution describes the random number of failures before observing a fixed number r of successes.

The PMF of a Negative-Binomial distribution is...

\(\displaystyle f(x)={r+x-1\choose x}\pi^r(1-\pi)^{x},\quad\mbox{for }x=0,1,\ldots\)

Like the Poisson, the negative binomial distribution can also be used to model counts of phenomena, but unlike the Poisson, the negative binomial has an additional parameter that allows the mean and variance to be estimated separately, which is often a better fit to the data. Specifically, we have for the negative binomial distribution

\(E(X)=\dfrac{r(1-\pi)}{\pi}=\mu\mbox{ and } V(X)=\mu+\dfrac{1}{r}\mu^2\)

Multinomial distribution

The multinomial distribution generalizes the binomial to cases involving \(k\) outcomes with probabilities \(\pi_1,\ldots,\pi_k\). We still need a fixed number of independent trials n, but instead of counting only the number of one particular "success" outcome, we let \(X_j\) count the number of times the \(j\)th outcome occurs, resulting in the multivariate random vector \(X_1,\ldots,X_k\).

The PMF of a Multinomial distribution is...

\(f(x_1,\ldots,x_k)=\dfrac{n!}{x_1!x_2!\cdots x_k!} \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_k^{x_k}\) , \(x=(x_1,\ldots,x_k)\)

In addition to the mean and variance of \(X_j\), given by

\(E(X_j)=n\pi_j\) and \(V(X_j)=n\pi_j(1-\pi_j)\),

there is also a covariance between different outcome counts \(X_i\) and \(X_j\):

\(cov(X_i,X_j)=-n\pi_i\pi_j\)

Intuitively, this negative relationship makes sense, given the fixed total n. In other words, the more often one outcome occurs, the less often other outcomes must occur if \(X_1+\cdots+X_k=n\) is to be preserved. Finally, note that if other outcomes are lumped together as "failure", each marginal count \(X_j\) has a binomial distribution with n trials and success probability \(\pi_j\).

Note on Technology

There are built-in R and SAS functions to compute various quantities for these distributions or to generate random samples.

In R, at the prompt type help(Binomial), help(NegBinomial), help(poisson), etc. to learn more.

See the SAS User's Guide for examples.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility