1.5 - Maximum Likelihood Estimation

One of the most fundamental concepts of modern statistics is that of likelihood. In each of the discrete random variables we have considered thus far, the distribution depends on one or more parameters that are, in most statistical applications, unknown. In the Poisson distribution, the parameter is \(\lambda\). In the binomial, the parameter of interest is \(\pi\) (since n is typically fixed and known).

The likelihood function is essentially the distribution of a random variable (or joint distribution of all values if a sample of the random variable is obtained) viewed as a function of the parameter(s). The reason for viewing it this way is that the data values will be observed and can be substituted in, and the value of the unknown parameter that maximizes this likelihood function can then be found. The intuition is that this maximizing value is the one that makes our observed data most likely.

Bernoulli and Binomial Likelihoods

Consider a random sample of \(n\) Bernoulli random variables, \(X_1,\ldots,X_n\), each with PMF

\(f(x)=\pi^x(1-\pi)^{1-x}\qquad x=0,1\)

The likelihood function is the joint distribution of these sample values, which we can write by independence

\(\ell(\pi)=f(x_1,\ldots,x_n;\pi)=\pi^{\sum_i x_i}(1-\pi)^{n-\sum_i x_i}\)

We interpret \(\ell(\pi)\) as the probability of observing \(X_1,\ldots,X_n\) as a function of \(\pi\), and the maximum likelihood estimate (MLE) of \(\pi\) is the value of \(\pi\) that maximizes this probability function. Equivalently, \(L(\pi)=\log\ell(\pi)\) is maximized at the same value and can be used interchangeably; more often than not, the loglikelihood function is easier to work with.

You may have noticed that the likelihood function for the sample of Bernoulli random variables depends only on their sum, which we can write as \(Y=\sum_i X_i\). Since \(Y\) has a binomial distribution with \(n\) trials and success probability \(\pi\), we can write its log likelihood function as

\(\displaystyle L(\pi) = \log {n\choose y} \pi^y(1 - \pi)^{n-y}\)

The only difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial coefficient \({n\choose y}\). But since that doesn't depend on \(\pi\), it has no influence on the MLE and may be neglected.

With a little calculus (taking the derivative with respect to \(\pi\)), we can show that the value of \(\pi\) that maximizes the likelihood (and log likelihood) function is \(Y/n\), which we denote as the MLE \(\hat{\pi}\). Not surprisingly, this is the familiar sample proportion of successes that intuitively makes sense as a good estimate for the population proportion.

Example: Binomial Example 1

If in our earlier binomial sample of 20 smartphone users, we observe 8 that use Android, the MLE for \(\pi\) is then \(8/20=.4\). The plot below illustrates this maximizing value for both the likelihood and log likelihood functions. The "dbinom" function is the PMF for the binomial distribution.

likeli.plot = function(y,n)
{
    L = function(p) dbinom(y,n,p)
    mle = optimize(L, interval=c(0,1), maximum=TRUE)$max
    p = (1:100)/100
    par(mfrow=c(2,1))
    plot(p, L(p), type='l')
    abline(v=mle)
    plot(p, log(L(p)), type='l')
    abline(v=mle)
    mle
}
likeli.plot(8,20)

Figure 1.7: Likelihood and loglikelihood plots for \(y=8\) and \(n=20\)

Example: Binomial Example 2

We know that the likelihood function achieves its maximum value at the MLE, but how is the sample size related to the shape? Suppose that we observe \(X = 1\) from a binomial distribution with \(n = 4\) and \(\pi\). The MLE is then \(1/4=0.25\), and the graph of this function looks like this.

Figure 1.8: Likelihood plot for \(n=4\) and \(\hat{\pi}=0.25\)

Here is the program for creating this plot in SAS.


data for_plot;
do x=0.01 to 0.8 by 0.01;
y=log(x)+3*log(1-x);   *the log-likelihood function;
output;
end;
run;

/*plot options*/
goption reset=all colors=(black);
symbol1 i=spline line=1;
axis1 order=(0 to 1.0 by 0.2);


proc gplot data=for_plot;
plot y*x / haxis=axis1  ;
run;

quit;

Now suppose that we observe \(X = 10\) from a binomial distribution with \(n = 40\). The MLE is again \(\hat{\pi}=10/40=0.25\), but the loglikelihood is now narrow:

Figure 1.9: Likelihood plot for \(n=50\) and \(\hat{\pi}=0.25\)

Finally, suppose that we observe \(X = 100\) from a binomial with \(n = 400\). The MLE is still \(\hat{\pi}=100/400=0.25\), but the loglikelihood is now narrower still:

Figure 1.10: Likelihood plot for \(n=500\) and \(\hat{\pi}=0.25\)

As \(n\) gets larger, we observe that \(L(\pi)\) is becoming more sharply peaked around the MLE \(\hat{pi}\), which means the true parameter lies close to \(\hat{\pi}\). If the loglikelihood is highly peaked—that is, if it drops sharply as we move away from the MLE—then the evidence is strong that \(\pi\) is near the MLE. A flatter loglikelihood, on the other hand, means that more values are plausible.

Poisson Likelihood

Suppose that \(X = (X_1, X_2, \dots, X_n)\) are iid observations from a Poisson distribution with unknown parameter \(\lambda\). The likelihood function is

\begin{aligned} L(\lambda) =\prod\limits_{i=1}^{n} f\left(x_{i} ; \lambda\right) =\prod\limits_{i=1}^{n} \dfrac{\lambda^{x_{i}} e^{-\lambda}}{x_{i} !} =\dfrac{\lambda^{\sum_i x_{i}} e^{-n \lambda}}{x_{1} ! x_{2} ! \cdots x_{n} !} \end{aligned}

The corresponding loglikelihood function is

\(\sum\limits_{i=1}^{n} x_i\log\lambda-n\lambda-\sum\limits_{i=1}^{n} \log x_i!\)

And the MLE for \(\lambda\) can then be found by maximizing either of these with respect to \(\lambda\). Setting the first derivative equal to 0 gives the solution:

\(\hat{\lambda}=\sum\limits_{i=1}^{n} \dfrac{x_i}{n}\).

Thus, for a Poisson sample, the MLE for \(\lambda\) is just the sample mean.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility