2.3 - The Multinomial Distribution

Following up on our brief introduction to this extremely useful distribution, we go into more detail here in preparation for the goodness-of-fit test coming up. Recall that the multinomial distribution generalizes the binomial to accommodate more than two categories. For example, what if the respondents in a survey had three choices:

I feel optimistic.
I don't feel optimistic.
I'm not sure.

If we separately count the number of respondents answering each of these and collect them in a vector, we can use the multinomial distribution to model the behavior of this vector.

Properties of the Multinomial Distribution

The multinomial distribution arises from an experiment with the following properties:

a fixed number \(n\) of trials
each trial is independent of the others
each trial has \(k\) mutually exclusive and exhaustive possible outcomes, denoted by \(E_1, \dots, E_k\)
on each trial, \(E_j\) occurs with probability \(\pi_j , j = 1, \dots , k\).

If we let \(X_j\) count the number of trials for which outcome \(E_j\) occurs, then the random vector \(X = \left(X_1, \dots, X_k\right)\) is said to have a multinomial distribution with index \(n\) and parameter vector \(\pi = \left(\pi_1, \dots, \pi_k\right)\), which we denote as

\(X ∼ Mult\left(n, \pi\right)\)

In most problems, \(n\) is known (e.g., it will represent the sample size). Note that we must have \(\pi_1 + \cdots + \pi_k = 1\) and \(X_1+\cdots+X_k=n\).

Marginal Counts: The individual or marginal components of a multinomial random vector are binomial and have a binomial distribution. That is, if we focus on the \(j\)th category as "success" and all other categories collectively as "failure", then \(Xj \sim Bin\left(n, \pi_j\right)\), for \(j=1,\ldots,k\).

2.3.1 - Distribution function

The function that relates a given value of a random variable to its probability is known as the distribution function. As we saw with maximum likelihood estimation, this can also be viewed as the likelihood function with respect to the parameters \(\pi_k\).

The distribution function

The probability that \(X = \left(X_1, \dots, X_k \right)\) takes a particular value \(x = \left(x_1, \dots , x_k \right)\) is

\(f(x)=\dfrac{n!}{x_1!x_2!\cdots x_k!}\pi_1^{x_1} \pi_2^{x_2} \cdots \pi_k^{x_k}\)

The possible values of X are the set of x-vectors such that each \(x_j ∈ {0, 1, . . . , n}\) and \(x_1 + \dots + x_k = n\).

Example: Jury Selection

Suppose that the racial/ethnic distribution in a large city is given by the table that follows. Consider these three options as the parameters of a multinomial distribution.

Black	Hispanic	Other
20%	15%	65%

Suppose that a jury of twelve members is chosen from this city in such a way that each resident has an equal probability of being selected independently of every other resident. There are a number of questions that we can ask of this type of distribution.

Let's find the probability that the jury contains:

three Black, two Hispanic, and seven Other members;
four Black and eight Other members;
at most one Black member.

To solve this problem, let \(X = \left(X_1, X_2, X_3\right)\) where \(X_1 =\) number of Black members, \(X_2 =\) number of Hispanic members, and \(X_3 =\) number of Other members. Then \(X\) has a multinomial distribution with parameters \(n = 12\) and \(\pi = \left(.20, .15, .65\right)\). The answer to the first part is

\begin{align} P(X_1=3,X_2=2,X_3=7) &= \dfrac{n!}{x_1!x_2!x_3!} \pi_1^{x_1}\pi_2^{x_2}\pi_3^{x_3}\\ &= \dfrac{12!}{3!2!7!}(0.20)^3(0.15)^2(0.65)^7\\ &= 0.0699\\ \end{align}

The answer to the second part is

\begin{align} P(X_1=4,X_2=0,X_3=8) &= \dfrac{12!}{4!0!8!}(0.20)^4(0.15)^0(0.65)^8\\ &= 0.0252\\ \end{align}

For the last part, note that "at most one Black member" means \(X_1 = 0\) or \(X_1 = 1\). \(X_1\) is a binomial random variable with \(n = 12\) and \(\pi_1 = .2\). Using the binomial probability distribution,

\(P(X_1=0) = \dfrac{12!}{0!12!}(0.20)^0(0.8)^{12}= 0.0687\)

and

\(P(X_1=1) = \dfrac{12!}{1!11!}(0.20)^1(0.8)^{11}= 0.2061\)

Therefore, the answer is:

\(P\left(X_1 = 0\right) + P\left(X_1 = 1\right) = 0.0687 + 0.2061 =\) \(0.2748\)

2.3.2 - Moments

Many of the elementary properties of the multinomial can be derived by decomposing \(X\) as the sum of iid random vectors,

\(X=Y_1+\cdots+Y_n\)

where each \(Y_i \sim Mult\left(1, \pi\right)\). In this decomposition, \(Y_i\) represents the outcome of the \(i\)th trial; it's a vector with a 1 in position \(j\) if \(E_j)\) occurred on that trial and 0s in all other positions. The elements of \(Y_i\) are correlated Bernoulli random variables. For example, with \(k=2\) possible outcomes on each trial, then \(Y_i=(\# E_1,\# E_2)\) on the \(i\)th trial, and the possible values of \(Y_i\) are

(1, 0) with probability \(\pi_1\),

(0, 1) with probability \(\pi_2 = 1− \pi_1\).

Because the individual elements of \(Y_i\) are Bernoulli, the mean of \(Y_i\) is \(\pi = \left(\pi_1, \pi_2\right)\), and its covariance matrix is

\begin{bmatrix} \pi_1(1-\pi_1) & -\pi_1\pi_2 \\ -\pi_1\pi_2 & \pi_2(1-\pi_2) \end{bmatrix}

Establishing the covariance term (off-diagonal element) requires a bit more work, but note that intuitively it should be negative because exactly one of either \(E_1\) or \(E_2\) must occur.

More generally, with \(k\) possible outcomes, the mean of \(Y_i\) is \(\pi = \left(\pi_1, \dots , \pi_k\right)\), and the covariance matrix is

\begin{bmatrix} \pi_1(1-\pi_1) & -\pi_1\pi_2 & \cdots & -\pi_1\pi_k \\ -\pi_1\pi_2 & \pi_2(1-\pi_2) & \cdots & -\pi_2\pi_k \\ \vdots & \vdots & \ddots & \vdots \\ -\pi_1\pi_k & -\pi_2\pi_k & \cdots & \pi_k(1-\pi_k) \end{bmatrix}

And finally returning to \(X=Y_1+\cdots+Y_n\) in full generality, we have that

\(E(X)=n\pi=(n\pi_1,\ldots,n\pi_k)\)

with covariance matrix

\begin{bmatrix} n\pi_1(1-\pi_1) & -n\pi_1\pi_2 & \cdots & -n\pi_1\pi_k \\ -n\pi_1\pi_2 & n\pi_2(1-\pi_2) & \cdots & -n\pi_2\pi_k \\ \vdots & \vdots & \ddots & \vdots \\ -n\pi_1\pi_k & -n\pi_2\pi_k & \cdots & n\pi_k(1-\pi_k) \end{bmatrix}

Because the elements of \(X\) are constrained to sum to \(n\), this covariance matrix is singular. If all the \(\pi_j\)s are positive, then the covariance matrix has rank \(k-1\). Intuitively, this makes sense since the last element \(X_k\) can be replaced by \(n − X_1− \dots − X_{k−1}\); there are really only \(k-1\) "free" elements in \(X\). If some elements of \(\pi\) are zero, the rank drops by one for every zero element.

2.3.3 - Parameter space

If we don't impose any restrictions on the parameter

\(\pi=(\pi_1,\pi_2,\ldots,\pi_k)\)

other than the logically necessary constraints

\(\pi_j \in [0,1],j=1,\ldots,k\) (1)

and

\(\pi_1+\pi_2+\ldots+\pi_k=1\) (2)

then the parameter space is the set of all \(\pi\)-vectors that satisfy (1) and (2). This set is called a simplex. In the special case of k = 3, we can visualize \(\pi = \left(\pi_1, \pi_2, \pi_3\right)\) as a point in three-dimensional space. The simplex S is the triangular portion of a plane with vertices at (1, 0, 0), (0, 1, 0) and (0, 0, 1):

More generally, the simplex is a portion of a (k − 1)-dimensional hyperplane in k-dimensional space. Alternatively, we can replace

\(\pi_k\text{ by }1-\pi_1-\pi_2-\cdots-\pi_{k-1}\)

because it's not really a free parameter and view the simplex in (k − 1)-dimensional space. For example, with k = 3, we can replace \(\pi_3\) by \(1 − \pi_1− \pi_2\) and view the parameter space as a triangle:

2.3.4 - Maximum Likelihood Estimation

If \(X \sim Mult\left(n, \pi\right)\) and we observe \(X = x\), then the loglikelihood function for \(\pi\) is

\(L(\pi)=\log\dfrac{n!}{n_1!\cdots n_k!}+x_1 \log\pi_1+\cdots+x_k \log\pi_k\)

We usually ignore the leading factorial coefficient because it doesn't involve \(\pi\) and will not influence the point where \(L\) is maximized. Using multivariate calculus with the constraint that

\(\pi_1+\ldots+\pi_k=1\)

the maximum is achieved at the vector of sample proportions:

\begin{align} \hat{\pi} = \dfrac{1}{n}x= (x_1/n,x_2/n,\ldots,x_k/n)\\ \end{align}

2.3.5 - Fusing and Partitioning Cells

We can collapse a multinomial vector by fusing cells (i.e. by adding some of the cell counts \(X_j\) together). If

\(X=(X_1,\ldots,X_k)\sim Mult(n,\pi)\)

where \(\pi = \left(\pi_1, \dots , \pi_k\right)\), then

\(X^\ast=(X_1+X_2,X_3,X_4,\ldots,X_k)\)

is also multinomial with the same index \(n\) and modified parameter \(\pi* = \left(\pi_1 + \pi_2, \pi_3, \dots , \pi_k\right)\). In the multinomial experiment, we are simply fusing the events \(E_1\) and \(E_2\) into the single event "\(E_1\) or \(E_2\)". Because these events are mutually exclusive,

\(P(E_1\text{ or }E_2)=P(E_1)+P(E_2)=\pi_1+\pi_2\)

We can also partition the multinomial by conditioning on (treating as fixed) the totals of subsets of cells. For example, consider the conditional distribution of \(X\) given that...

\(X_1+X_2=z\)

\(X_3+X_4+\cdots+X_k=n-z\)

The subvectors \(\left(X_1, X_2\right)\) and \(\left(X_3, X_4, \dots, X_k \right)\) are conditionally independent and multinomial,

\((X_1,X_2)\sim Mult\left[z,\left(\dfrac{\pi_1}{\pi_1+\pi_2},\dfrac{\pi_2}{\pi_1+\pi_2}\right)\right]\)

\((X_3,\ldots,X_k)\sim Mult\left[n-z,\left(\dfrac{\pi_3}{\pi_3+\cdots+\pi_k},\cdots,\dfrac{\pi_k}{\pi_3+\cdots+\pi_k}\right)\right]\)

The joint distribution of two or more independent multinomials is called the "product-multinomial." If we condition on the sums of non-overlapping groups of cells of a multinomial vector, its distribution splits into the product-multinomial. The parameter for each part of the product-multinomial is a portion of the original \(\pi\) vector, normalized to sum to one.

2.3.6 - Relationship between the Multinomial and the Poisson

Suppose that \(X_{1}, \dots, X_{k}\) are independent Poisson random variables,

\(\begin{aligned}&X_{1} \sim P\left(\lambda_{1}\right)\\&X_{2} \sim P\left(\lambda_{2}\right)\\&...\\&X_{k} \sim P\left(\lambda_{k}\right)\end{aligned}\)

where the \(\lambda_{j}\)'s are not necessarily equal. Then the conditional distribution of the vector

\(X=(X_1,\ldots,X_k)\)

given the total \(n=X_1+\ldots+X_k\) is \(Mult\left(n, \pi\right)\), where \(\pi=(\pi_1,\ldots,\pi_k)\), and

\(\pi_j=\dfrac{\lambda_j}{\lambda_1+\cdots+\lambda_k}\)

That is, \(\pi\) is simply the vector of \(\lambda_{j}\)s normalized to sum to one.

This fact is important because it implies that the unconditional distribution of \(\left(X_{1}, \dots, X_{k}\right)\) can be factored into the product of two distributions: a Poisson distribution for the overall total,

\(n\sim P(\lambda_1+\lambda_2+\cdots+\lambda_k)\),

and a multinomial distribution for \(X = \left(X_{1}, \dots, X_{k}\right)\) given \(n\),

\(X\sim Mult(n,\pi)\)

The likelihood factors into two independent functions, one for \(\sum\limits_{j=1}^k \lambda_j\) and the other for \(\pi\). The total \(n\) carries no information about \(\pi\) and vice-versa. Therefore, likelihood-based inferences about \(\pi\) are the same whether we regard \(X_{1}, \dots, X_{k}\) as sampled from \(k\) independent Poissons or from a single multinomial, and any estimates, tests, etc. for \(\pi\) or functions of \(\pi\) will be the same, whether we regard \(n\) as random or fixed.

Example: Vehicle Color

Suppose, while waiting at a busy intersection for one hour, we record the color of each vehicle as it drives by. Let

\(X_{1} =\) number of white vehicles

\(X_{2} =\) number of black vehicles

\(X_{3} =\) number of silver vehicles

\(X_{4} =\) number of red vehicles

\(X_{5} =\) number of blue vehicles

\(X_{6} =\) number of green vehicles

\(X_{7} =\) number of any other color

In this experiment, the total number of vehicles observed, \(n=X_1+\cdots+X_7\) is random. (It would have been fixed if, for example, we had decided to classify the first \(n=500\) vehicles we see. But because we decided to wait for one hour, \(n\) is random.)

In this case, it's reasonable to regard the \(X_{j}\)s as independent Poisson random variables with means \(\lambda_{1}, \ldots, \lambda_{7}\). But if our interest lies not in the \(\lambda_{j}\)s but in the proportions of various colors in the vehicle population, inferences about these proportions will be the same whether we regard the sample sizes \(n_{j}\) as random or fixed. That is, we can proceed as if

\(X=(X_1,\ldots,X_7)\sim Mult(n,\pi)\)

where \(\pi = \left(\pi_{1}, \dots, \pi_{7}\right)\), even though \(n\) is actually random.

2.3.7 - Chi-Square Approximation

Recall for large \(n\) that the chi-square distribution (\(\nu=1\)) may be used as an approximation to \(X\sim Bin(n,\pi)\):

\( \left(\dfrac{X-n\pi}{\sqrt{n\pi(1-\pi)}}\right)^2 \)

With a little algebraic manipulation, we can expand this into parts due to successes and failures:

\( \left(\dfrac{X-n\pi}{\sqrt{n\pi}}\right)^2 + \left(\dfrac{(n-X)-n(1-\pi)}{\sqrt{n(1-\pi)}}\right)^2\)

The benefit of writing it this way is to see how it can be generalized to the multinomial setting. That is, if \(X=(X_1,\ldots,X_k)\sim Mult(n,\pi)\), then

\(Q=\left(\dfrac{X_1-n\pi_1}{\sqrt{n\pi_1}}\right)^2 +\cdots+ \left(\dfrac{X_k-n\pi_k}{\sqrt{n\pi_k}}\right)^2\)

And \(Q\) has an approximate chi-square distribution with \(\nu=k-1\) degrees of freedom, provided the sample size is large. The usual condition to check for the sample size requirement is that all sample counts \(n\hat{\pi}_j\) are at least 5, although this is not a strict rule.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility