Lesson 1: Point Estimation
Lesson 1: Point EstimationOverview
Suppose we have an unknown population parameter, such as a population mean \(\mu\) or a population proportion \(p\), which we'd like to estimate. For example, suppose we are interested in estimating:
 \(p\) = the (unknown) proportion of American college students, 1824, who have a smart phone
 \(\mu\) = the (unknown) mean number of days it takes Alzheimer's patients to achieve certain milestones
In either case, we can't possibly survey the entire population. That is, we can't survey all American college students between the ages of 18 and 24. Nor can we survey all patients with Alzheimer's disease. So, of course, we do what comes naturally and take a random sample from the population, and use the resulting data to estimate the value of the population parameter. Of course, we want the estimate to be "good" in some way.
In this lesson, we'll learn two methods, namely the method of maximum likelihood and the method of moments, for deriving formulas for "good" point estimates for population parameters. We'll also learn one way of assessing whether a point estimate is "good." We'll do that by defining what a means for an estimate to be unbiased.
Objectives
 To learn how to find a maximum likelihood estimator of a population parameter.
 To learn how to find a method of moments estimator of a population parameter.
 To learn how to check to see if an estimator is unbiased for a particular parameter.
 To understand the steps involved in each of the proofs in the lesson.
 To be able to apply the methods learned in the lesson to new problems.
1.1  Definitions
1.1  DefinitionsWe'll start the lesson with some formal definitions. In doing so, recall that we denote the \(n\) random variables arising from a random sample as subscripted uppercase letters:
\(X_1, X_2, \cdots, X_n\)
The corresponding observed values of a specific random sample are then denoted as subscripted lowercase letters:
\(x_1, x_2, \cdots, x_n\)
 Parameter Space
 The range of possible values of the parameter \(\theta\) is called the parameter space \(\Omega\) (the greek letter "omega").
For example, if \(\mu\) denotes the mean grade point average of all college students, then the parameter space (assuming a 4point grading scale) is:
\(\Omega=\{\mu: 0\le \mu\le 4\}\)
And, if \(p\) denotes the proportion of students who smoke cigarettes, then the parameter space is:
\(\Omega=\{p:0\le p\le 1\}\)
 Point Estimator
 The function of \(X_1, X_2, \cdots, X_n\), that is, the statistic \(u=(X_1, X_2, \cdots, X_n)\), used to estimate \(\theta\) is called a point estimator of \(\theta\).
For example, the function:
\(\bar{X}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
is a point estimator of the population mean \(\mu\). The function:
\(\hat{p}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
(where \(X_i=0\text{ or }1)\) is a point estimator of the population proportion \(p\). And, the function:
\(S^2=\dfrac{1}{n1}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
is a point estimator of the population variance \(\sigma^2\).
 Point Estimate
 The function \(u(x_1, x_2, \cdots, x_n)\) computed from a set of data is an observed point estimate of \(\theta\).
For example, if \(x_i\) are the observed grade point averages of a sample of 88 students, then:
\(\bar{x}=\dfrac{1}{88}\sum\limits_{i=1}^{88} x_i=3.12\)
is a point estimate of \(\mu\), the mean grade point average of all the students in the population.
And, if \(x_i=0\) if a student has no tattoo, and \(x_i=1\) if a student has a tattoo, then:
\(\hat{p}=0.11\)
is a point estimate of \(p\), the proportion of all students in the population who have a tattoo.
Now, with the above definitions aside, let's go learn about the method of maximum likelihood.
1.2  Maximum Likelihood Estimation
1.2  Maximum Likelihood EstimationStatement of the Problem
Suppose we have a random sample \(X_1, X_2, \cdots, X_n\) whose assumed probability distribution depends on some unknown parameter \(\theta\). Our primary goal here will be to find a point estimator \(u(X_1, X_2, \cdots, X_n)\), such that \(u(x_1, x_2, \cdots, x_n)\) is a "good" point estimate of \(\theta\), where \(x_1, x_2, \cdots, x_n\) are the observed values of the random sample. For example, if we plan to take a random sample \(X_1, X_2, \cdots, X_n\) for which the \(X_i\) are assumed to be normally distributed with mean \(\mu\) and variance \(\sigma^2\), then our goal will be to find a good estimate of \(\mu\), say, using the data \(x_1, x_2, \cdots, x_n\) that we obtained from our specific random sample.
The Basic Idea
It seems reasonable that a good estimate of the unknown parameter \(\theta\) would be the value of \(\theta\) that maximizes the probability, errrr... that is, the likelihood... of getting the data we observed. (So, do you see from where the name "maximum likelihood" comes?) So, that is, in a nutshell, the idea behind the method of maximum likelihood estimation. But how would we implement the method in practice? Well, suppose we have a random sample \(X_1, X_2, \cdots, X_n\) for which the probability density (or mass) function of each \(X_i\) is \(f(x_i;\theta)\). Then, the joint probability mass (or density) function of \(X_1, X_2, \cdots, X_n\), which we'll (not so arbitrarily) call \(L(\theta)\) is:
\(L(\theta)=P(X_1=x_1,X_2=x_2,\ldots,X_n=x_n)=f(x_1;\theta)\cdot f(x_2;\theta)\cdots f(x_n;\theta)=\prod\limits_{i=1}^n f(x_i;\theta)\)
The first equality is of course just the definition of the joint probability mass function. The second equality comes from that fact that we have a random sample, which implies by definition that the \(X_i\) are independent. And, the last equality just uses the shorthand mathematical notation of a product of indexed terms. Now, in light of the basic idea of maximum likelihood estimation, one reasonable way to proceed is to treat the "likelihood function" \(L(\theta)\) as a function of \(\theta\), and find the value of \(\theta\) that maximizes it.
Is this still sounding like too much abstract gibberish? Let's take a look at an example to see if we can make it a bit more concrete.
Example 11
Suppose we have a random sample \(X_1, X_2, \cdots, X_n\) where:
 \(X_i=0\) if a randomly selected student does not own a sports car, and
 \(X_i=1\) if a randomly selected student does own a sports car.
Assuming that the \(X_i\) are independent Bernoulli random variables with unknown parameter \(p\), find the maximum likelihood estimator of \(p\), the proportion of students who own a sports car.
Answer
If the \(X_i\) are independent Bernoulli random variables with unknown parameter \(p\), then the probability mass function of each \(X_i\) is:
\(f(x_i;p)=p^{x_i}(1p)^{1x_i}\)
for \(x_i=0\) or 1 and \(0<p<1\). Therefore, the likelihood function \(L(p)\) is, by definition:
\(L(p)=\prod\limits_{i=1}^n f(x_i;p)=p^{x_1}(1p)^{1x_1}\times p^{x_2}(1p)^{1x_2}\times \cdots \times p^{x_n}(1p)^{1x_n}\)
for \(0<p<1\). Simplifying, by summing up the exponents, we get :
\(L(p)=p^{\sum x_i}(1p)^{n\sum x_i}\)
Now, in order to implement the method of maximum likelihood, we need to find the \(p\) that maximizes the likelihood \(L(p)\). We need to put on our calculus hats now since, in order to maximize the function, we are going to need to differentiate the likelihood function with respect to \(p\). In doing so, we'll use a "trick" that often makes the differentiation a bit easier. Note that the natural logarithm is an increasing function of \(x\):
That is, if \(x_1<x_2\), then \(f(x_1)<f(x_2)\). That means that the value of \(p\) that maximizes the natural logarithm of the likelihood function \(\ln L(p)\) is also the value of \(p\) that maximizes the likelihood function \(L(p)\). So, the "trick" is to take the derivative of \(\ln L(p)\) (with respect to \(p\)) rather than taking the derivative of \(L(p)\). Again, doing so often makes the differentiation much easier. (By the way, throughout the remainder of this course, I will use either \(\ln L(p)\) or \(\log L(p)\) to denote the natural logarithm of the likelihood function.)
In this case, the natural logarithm of the likelihood function is:
\(\text{log}L(p)=(\sum x_i)\text{log}(p)+(n\sum x_i)\text{log}(1p)\)
Now, taking the derivative of the loglikelihood, and setting it to 0, we get:
\(\displaystyle{\frac{\partial \log L(p)}{\partial p}=\frac{\sum x_{i}}{p}\frac{\left(n\sum x_{i}\right)}{1p} \stackrel{SET}{\equiv} 0}\)
Now, multiplying through by \(p(1p)\), we get:
\((\sum x_i)(1p)(n\sum x_i)p=0\)
Upon distribution, we see that two of the resulting terms cancel each other out:
\(\sum x_{i}  \color{red}\cancel {\color{black}p \sum x_{i}} \color{black}n p+ \color{red}\cancel {\color{black} p \sum x_{i}} \color{black} = 0\)
leaving us with:
\(\sum x_inp=0\)
Now, all we have to do is solve for \(p\). In doing so, you'll want to make sure that you always put a hat ("^") on the parameter, in this case, \(p\), to indicate it is an estimate:
\(\hat{p}=\dfrac{\sum\limits_{i=1}^n x_i}{n}\)
or, alternatively, an estimator:
\(\hat{p}=\dfrac{\sum\limits_{i=1}^n X_i}{n}\)
Oh, and we should technically verify that we indeed did obtain a maximum. We can do that by verifying that the second derivative of the loglikelihood with respect to \(p\) is negative. It is, but you might want to do the work to convince yourself!
Now, with that example behind us, let us take a look at formal definitions of the terms:
 Likelihood function
 Maximum likelihood estimators
 Maximum likelihood estimates.
Definition. Let \(X_1, X_2, \cdots, X_n\) be a random sample from a distribution that depends on one or more unknown parameters \(\theta_1, \theta_2, \cdots, \theta_m\)_{ }with probability density (or mass) function \(f(x_i; \theta_1, \theta_2, \cdots, \theta_m)\). Suppose that \((\theta_1, \theta_2, \cdots, \theta_m)\) is restricted to a given parameter space \(\Omega\). Then:

When regarded as a function of \(\theta_1, \theta_2, \cdots, \theta_m\), the joint probability density (or mass) function of \(X_1, X_2, \cdots, X_n\):
\(L(\theta_1,\theta_2,\ldots,\theta_m)=\prod\limits_{i=1}^n f(x_i;\theta_1,\theta_2,\ldots,\theta_m)\)
(\((\theta_1, \theta_2, \cdots, \theta_m)\) in \(\Omega\)) is called the likelihood function.

If:
\([u_1(x_1,x_2,\ldots,x_n),u_2(x_1,x_2,\ldots,x_n),\ldots,u_m(x_1,x_2,\ldots,x_n)]\)
is the \(m\)tuple that maximizes the likelihood function, then:
\(\hat{\theta}_i=u_i(X_1,X_2,\ldots,X_n)\)
is the maximum likelihood estimator of \(\theta_i\), for \(i=1, 2, \cdots, m\).

The corresponding observed values of the statistics in (2), namely:
\([u_1(x_1,x_2,\ldots,x_n),u_2(x_1,x_2,\ldots,x_n),\ldots,u_m(x_1,x_2,\ldots,x_n)]\)
are called the maximum likelihood estimates of \(\theta_i\), for \(i=1, 2, \cdots, m\).
Example 12
Suppose the weights of randomly selected American female college students are normally distributed with unknown mean \(\mu\) and standard deviation \(\sigma\). A random sample of 10 American female college students yielded the following weights (in pounds):
115 122 130 127 149 160 152 138 149 180
Based on the definitions given above, identify the likelihood function and the maximum likelihood estimator of \(\mu\), the mean weight of all American female college students. Using the given sample, find a maximum likelihood estimate of \(\mu\) as well.
Answer
The probability density function of \(X_i\) is:
\(f(x_i;\mu,\sigma^2)=\dfrac{1}{\sigma \sqrt{2\pi}}\text{exp}\left[\dfrac{(x_i\mu)^2}{2\sigma^2}\right]\)
for \(\infty<x<\infty\). The parameter space is \(\Omega=\{(\mu, \sigma):\infty<\mu<\infty \text{ and }0<\sigma<\infty\}\). Therefore, (you might want to convince yourself that) the likelihood function is:
\(L(\mu,\sigma)=\sigma^{n}(2\pi)^{n/2}\text{exp}\left[\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^n(x_i\mu)^2\right]\)
for \(\infty<\mu<\infty \text{ and }0<\sigma<\infty\). It can be shown (we'll do so in the next example!), upon maximizing the likelihood function with respect to \(\mu\), that the maximum likelihood estimator of \(\mu\) is:
\(\hat{\mu}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\)
Based on the given sample, a maximum likelihood estimate of \(\mu\) is:
\(\hat{\mu}=\dfrac{1}{n}\sum\limits_{i=1}^n x_i=\dfrac{1}{10}(115+\cdots+180)=142.2\)
pounds. Note that the only difference between the formulas for the maximum likelihood estimator and the maximum likelihood estimate is that:
 the estimator is defined using capital letters (to denote that its value is random), and
 the estimate is defined using lowercase letters (to denote that its value is fixed and based on an obtained sample)
Okay, so now we have the formal definitions out of the way. The first example on this page involved a joint probability mass function that depends on only one parameter, namely \(p\), the proportion of successes. Now, let's take a look at an example that involves a joint probability density function that depends on two parameters.
Example 13
Let \(X_1, X_2, \cdots, X_n\) be a random sample from a normal distribution with unknown mean \(\mu\) and variance \(\sigma^2\). Find maximum likelihood estimators of mean \(\mu\) and variance \(\sigma^2\).
Answer
In finding the estimators, the first thing we'll do is write the probability density function as a function of \(\theta_1=\mu\) and \(\theta_2=\sigma^2\):
\(f(x_i;\theta_1,\theta_2)=\dfrac{1}{\sqrt{\theta_2}\sqrt{2\pi}}\text{exp}\left[\dfrac{(x_i\theta_1)^2}{2\theta_2}\right]\)
for \(\infty<\theta_1<\infty \text{ and }0<\theta_2<\infty\). We do this so as not to cause confusion when taking the derivative of the likelihood with respect to \(\sigma^2\). Now, that makes the likelihood function:
\( L(\theta_1,\theta_2)=\prod\limits_{i=1}^n f(x_i;\theta_1,\theta_2)=\theta^{n/2}_2(2\pi)^{n/2}\text{exp}\left[\dfrac{1}{2\theta_2}\sum\limits_{i=1}^n(x_i\theta_1)^2\right]\)
and therefore the log of the likelihood function:
\(\text{log} L(\theta_1,\theta_2)=\dfrac{n}{2}\text{log}\theta_2\dfrac{n}{2}\text{log}(2\pi)\dfrac{\sum(x_i\theta_1)^2}{2\theta_2}\)
Now, upon taking the partial derivative of the log likelihood with respect to \(\theta_1\), and setting to 0, we see that a few things cancel each other out, leaving us with:
\(\displaystyle{\frac{\partial \log L\left(\theta_{1}, \theta_{2}\right)}{\partial \theta_{1}}=\frac{\color{red} \cancel {\color{black}2} \color{black}\sum\left(x_{i}\theta_{1}\right)\color{red}\cancel{\color{black}(1)}}{\color{red}\cancel{\color{black}2} \color{black} \theta_{2}} \stackrel{\text { SET }}{\equiv} 0}\)
Now, multiplying through by \(\theta_2\), and distributing the summation, we get:
\(\sum x_in\theta_1=0\)
Now, solving for \(\theta_1\), and putting on its hat, we have shown that the maximum likelihood estimate of \(\theta_1\)_{ }is:
\(\hat{\theta}_1=\hat{\mu}=\dfrac{\sum x_i}{n}=\bar{x}\)
Now for \(\theta_2\). Taking the partial derivative of the log likelihood with respect to \(\theta_2\), and setting to 0, we get:
\(\displaystyle{\frac{\partial \log L\left(\theta_{1}, \theta_{2}\right)}{\partial \theta_{2}}=\frac{n}{2 \theta_{2}}+\frac{\sum\left(x_{i}\theta_{1}\right)^{2}}{2 \theta_{2}^{2}} \stackrel{\text { SET }}{\equiv} 0}\)
Multiplying through by \(2\theta^2_2\):
\(\displaystyle{\frac{\partial \log L\left(\theta_{1}, \theta_{2}\right)}{\partial \theta_{1}}=\left[\frac{n}{2 \theta_{2}}+\frac{\sum\left(x_{i}\theta_{1}\right)^{2}}{2 \theta_{2}^{2}} \stackrel{s \epsilon \epsilon}{\equiv} 0\right] \times 2 \theta_{2}^{2}}\)
we get:
\(n\theta_2+\sum(x_i\theta_1)^2=0\)
And, solving for \(\theta_2\), and putting on its hat, we have shown that the maximum likelihood estimate of \(\theta_2\)_{ }is:
\(\hat{\theta}_2=\hat{\sigma}^2=\dfrac{\sum(x_i\bar{x})^2}{n}\)
(I'll again leave it to you to verify, in each case, that the second partial derivative of the log likelihood is negative, and therefore that we did indeed find maxima.) In summary, we have shown that the maximum likelihood estimators of \(\mu\) and variance \(\sigma^2\) for the normal model are:
\(\hat{\mu}=\dfrac{\sum X_i}{n}=\bar{X}\) and \(\hat{\sigma}^2=\dfrac{\sum(X_i\bar{X})^2}{n}\)
respectively.
Note that the maximum likelihood estimator of \(\sigma^2\) for the normal model is not the sample variance \(S^2\). They are, in fact, competing estimators. So how do we know which estimator we should use for \(\sigma^2\) ? Well, one way is to choose the estimator that is "unbiased." Let's go learn about unbiased estimators now.
1.3  Unbiased Estimation
1.3  Unbiased EstimationOn the previous page, we showed that if \(X_i\) are Bernoulli random variables with parameter \(p\), then:
\(\hat{p}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
is the maximum likelihood estimator of \(p\). And, if \(X_i\) are normally distributed random variables with mean \(\mu\) and variance \(\sigma^2\), then:
\(\hat{\mu}=\dfrac{\sum X_i}{n}=\bar{X}\) and \(\hat{\sigma}^2=\dfrac{\sum(X_i\bar{X})^2}{n}\)
are the maximum likelihood estimators of \(\mu\) and \(\sigma^2\), respectively. A natural question then is whether or not these estimators are "good" in any sense. One measure of "good" is "unbiasedness."
 Bias and Unbias Estimator

If the following holds:
\(E[u(X_1,X_2,\ldots,X_n)]=\theta\)
then the statistic \(u(X_1,X_2,\ldots,X_n)\) is an unbiased estimator of the parameter \(\theta\). Otherwise, \(u(X_1,X_2,\ldots,X_n)\) is a biased estimator of \(\theta\).
Example 14
If \(X_i\) is a Bernoulli random variable with parameter \(p\), then:
\(\hat{p}=\dfrac{1}{n}\sum\limits_{i=1}^nX_i\)
is the maximum likelihood estimator (MLE) of \(p\). Is the MLE of \(p\) an unbiased estimator of \(p\)?
Answer
Recall that if \(X_i\) is a Bernoulli random variable with parameter \(p\), then \(E(X_i)=p\). Therefore:
\(E(\hat{p})=E\left(\dfrac{1}{n}\sum\limits_{i=1}^nX_i\right)=\dfrac{1}{n}\sum\limits_{i=1}^nE(X_i)=\dfrac{1}{n}\sum\limits_{i=1}^np=\dfrac{1}{n}(np)=p\)
The first equality holds because we've merely replaced \(\hat{p}\) with its definition. The second equality holds by the rules of expectation for a linear combination. The third equality holds because \(E(X_i)=p\). The fourth equality holds because when you add the value \(p\) up \(n\) times, you get \(np\). And, of course, the last equality is simple algebra.
In summary, we have shown that:
\(E(\hat{p})=p\)
Therefore, the maximum likelihood estimator is an unbiased estimator of \(p\).
Example 15
If \(X_i\) are normally distributed random variables with mean \(\mu\) and variance \(\sigma^2\), then:
\(\hat{\mu}=\dfrac{\sum X_i}{n}=\bar{X}\) and \(\hat{\sigma}^2=\dfrac{\sum(X_i\bar{X})^2}{n}\)
are the maximum likelihood estimators of \(\mu\) and \(\sigma^2\), respectively. Are the MLEs unbiased for their respective parameters?
Answer
Recall that if \(X_i\) is a normally distributed random variable with mean \(\mu\) and variance \(\sigma^2\), then \(E(X_i)=\mu\) and \(\text{Var}(X_i)=\sigma^2\). Therefore:
\(E(\bar{X})=E\left(\dfrac{1}{n}\sum\limits_{i=1}^nX_i\right)=\dfrac{1}{n}\sum\limits_{i=1}^nE(X_i)=\dfrac{1}{n}\sum\limits_{i=1}\mu=\dfrac{1}{n}(n\mu)=\mu\)
The first equality holds because we've merely replaced \(\bar{X}\) with its definition. Again, the second equality holds by the rules of expectation for a linear combination. The third equality holds because \(E(X_i)=\mu\). The fourth equality holds because when you add the value \(\mu\) up \(n\) times, you get \(n\mu\). And, of course, the last equality is simple algebra.
In summary, we have shown that:
\(E(\bar{X})=\mu\)
Therefore, the maximum likelihood estimator of \(\mu\) is unbiased. Now, let's check the maximum likelihood estimator of \(\sigma^2\). First, note that we can rewrite the formula for the MLE as:
\(\hat{\sigma}^2=\left(\dfrac{1}{n}\sum\limits_{i=1}^nX_i^2\right)\bar{X}^2\)
because:
\(\displaystyle{\begin{aligned}
\hat{\sigma}^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}\bar{x}\right)^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}^{2}2 x_{i} \bar{x}+\bar{x}^{2}\right) \\
&=\frac{1}{n} \sum_{i=1}^{n} x_{i}^{2}2 \bar{x} \cdot \color{blue}\underbrace{\color{black}\frac{1}{n} \sum x_{i}}_{\bar{x}} \color{black} + \frac{1}{\color{blue}\cancel{\color{black} n}}\left(\color{blue}\cancel{\color{black}n} \color{black}\bar{x}^{2}\right) \\
&=\frac{1}{n} \sum_{i=1}^{n} x_{i}^{2}\bar{x}^{2}
\end{aligned}}\)
Then, taking the expectation of the MLE, we get:
\(E(\hat{\sigma}^2)=\dfrac{(n1)\sigma^2}{n}\)
as illustrated here:
\begin{align} E(\hat{\sigma}^2) &= E\left[\dfrac{1}{n}\sum\limits_{i=1}^nX_i^2\bar{X}^2\right]=\left[\dfrac{1}{n}\sum\limits_{i=1}^nE(X_i^2)\right]E(\bar{X}^2)\\ &= \dfrac{1}{n}\sum\limits_{i=1}^n(\sigma^2+\mu^2)\left(\dfrac{\sigma^2}{n}+\mu^2\right)\\ &= \dfrac{1}{n}(n\sigma^2+n\mu^2)\dfrac{\sigma^2}{n}\mu^2\\ &= \sigma^2\dfrac{\sigma^2}{n}=\dfrac{n\sigma^2\sigma^2}{n}=\dfrac{(n1)\sigma^2}{n}\\ \end{align}
The first equality holds from the rewritten form of the MLE. The second equality holds from the properties of expectation. The third equality holds from manipulating the alternative formulas for the variance, namely:
\(Var(X)=\sigma^2=E(X^2)\mu^2\) and \(Var(\bar{X})=\dfrac{\sigma^2}{n}=E(\bar{X}^2)\mu^2\)
The remaining equalities hold from simple algebraic manipulation. Now, because we have shown:
\(E(\hat{\sigma}^2) \neq \sigma^2\)
the maximum likelihood estimator of \(\sigma^2\) is a biased estimator.
Example 16
If \(X_i\) are normally distributed random variables with mean \(\mu\) and variance \(\sigma^2\), what is an unbiased estimator of \(\sigma^2\)? Is \(S^2\) unbiased?
Answer
Recall that if \(X_i\) is a normally distributed random variable with mean \(\mu\) and variance \(\sigma^2\), then:
\(\dfrac{(n1)S^2}{\sigma^2}\sim \chi^2_{n1}\)
Also, recall that the expected value of a chisquare random variable is its degrees of freedom. That is, if:
\(X \sim \chi^2_{(r)}\)
then \(E(X)=r\). Therefore:
\(E(S^2)=E\left[\dfrac{\sigma^2}{n1}\cdot \dfrac{(n1)S^2}{\sigma^2}\right]=\dfrac{\sigma^2}{n1} E\left[\dfrac{(n1)S^2}{\sigma^2}\right]=\dfrac{\sigma^2}{n1}\cdot (n1)=\sigma^2\)
The first equality holds because we effectively multiplied the sample variance by 1. The second equality holds by the law of expectation that tells us we can pull a constant through the expectation. The third equality holds because of the two facts we recalled above. That is:
\(E\left[\dfrac{(n1)S^2}{\sigma^2}\right]=n1\)
And, the last equality is again simple algebra.
In summary, we have shown that, if \(X_i\) is a normally distributed random variable with mean \(\mu\) and variance \(\sigma^2\), then \(S^2\) is an unbiased estimator of \(\sigma^2\). It turns out, however, that \(S^2\) is always an unbiased estimator of \(\sigma^2\), that is, for any model, not just the normal model. (You'll be asked to show this in the homework.) And, although \(S^2\) is always an unbiased estimator of \(\sigma^2\), \(S\) is not an unbiased estimator of \(\sigma\). (You'll be asked to show this in the homework, too.)
Sometimes it is impossible to find maximum likelihood estimators in a convenient closed form. Instead, numerical methods must be used to maximize the likelihood function. In such cases, we might consider using an alternative method of finding estimators, such as the "method of moments." Let's go take a look at that method now.
1.4  Method of Moments
1.4  Method of MomentsIn short, the method of moments involves equating sample moments with theoretical moments. So, let's start by making sure we recall the definitions of theoretical moments, as well as learn the definitions of sample moments.
Definitions.
 \(E(X^k)\) is the \(k^{th}\) (theoretical) moment of the distribution (about the origin), for \(k=1, 2, \ldots\)
 \(E\left[(X\mu)^k\right]\) is the \(k^{th}\) (theoretical) moment of the distribution (about the mean), for \(k=1, 2, \ldots\)
 \(M_k=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^k\) is the \(k^{th}\) sample moment, for \(k=1, 2, \ldots\)
 \(M_k^\ast =\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^k\) is the \(k^{th}\) sample moment about the mean, for \(k=1, 2, \ldots\)
One Form of the Method
The basic idea behind this form of the method is to:
 Equate the first sample moment about the origin \(M_1=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\) to the first theoretical moment \(E(X)\).
 Equate the second sample moment about the origin \(M_2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\) to the second theoretical moment \(E(X^2)\).
 Continue equating sample moments about the origin, \(M_k\), with the corresponding theoretical moments \(E(X^k), \; k=3, 4, \ldots\) until you have as many equations as you have parameters.
 Solve for the parameters.
The resulting values are called method of moments estimators. It seems reasonable that this method would provide good estimates, since the empirical distribution converges in some sense to the probability distribution. Therefore, the corresponding moments should be about equal.
Example 17
Let \(X_1, X_2, \ldots, X_n\) be Bernoulli random variables with parameter \(p\). What is the method of moments estimator of \(p\)?
Answer
Here, the first theoretical moment about the origin is:
\(E(X_i)=p\)
We have just one parameter for which we are trying to derive the method of moments estimator. Therefore, we need just one equation. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\(p=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
Now, we just have to solve for \(p\). Whoops! In this case, the equation is already solved for \(p\). Our work is done! We just need to put a hat (^) on the parameter to make it clear that it is an estimator. We can also subscript the estimator with an "MM" to indicate that the estimator is the method of moments estimator:
\(\hat{p}_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
So, in this case, the method of moments estimator is the same as the maximum likelihood estimator, namely, the sample proportion.
Example 18
Let \(X_1, X_2, \ldots, X_n\) be normal random variables with mean \(\mu\) and variance \(\sigma^2\). What are the method of moments estimators of the mean \(\mu\) and variance \(\sigma^2\)?
Answer
The first and second theoretical moments about the origin are:
\(E(X_i)=\mu\qquad E(X_i^2)=\sigma^2+\mu^2\)
(Incidentally, in case it's not obvious, that second moment can be derived from manipulating the shortcut formula for the variance.) In this case, we have two parameters for which we are trying to derive method of moments estimators. Therefore, we need two equations here. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\(E(X)=\mu=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
And, equating the second theoretical moment about the origin with the corresponding sample moment, we get:
\(E(X^2)=\sigma^2+\mu^2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\)
Now, the first equation tells us that the method of moments estimator for the mean \(\mu\) is the sample mean:
\(\hat{\mu}_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\)
And, substituting the sample mean in for \(\mu\) in the second equation and solving for \(\sigma^2\), we get that the method of moments estimator for the variance \(\sigma^2\) is:
\(\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\mu^2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\bar{X}^2\)
which can be rewritten as:
\(\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n( X_i\bar{X})^2\)
Again, for this example, the method of moments estimators are the same as the maximum likelihood estimators.
In some cases, rather than using the sample moments about the origin, it is easier to use the sample moments about the mean. Doing so provides us with an alternative form of the method of moments.
Another Form of the Method
The basic idea behind this form of the method is to:
 Equate the first sample moment about the origin \(M_1=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\) to the first theoretical moment \(E(X)\).
 Equate the second sample moment about the mean \(M_2^\ast=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^2\) to the second theoretical moment about the mean \(E[(X\mu)^2]\).
 Continue equating sample moments about the mean \(M^\ast_k\) with the corresponding theoretical moments about the mean \(E[(X\mu)^k]\), \(k=3, 4, \ldots\) until you have as many equations as you have parameters.
 Solve for the parameters.
Again, the resulting values are called method of moments estimators.
Example 19
Let \(X_1, X_2, \dots, X_n\) be gamma random variables with parameters \(\alpha\) and \(\theta\), so that the probability density function is:
\(f(x_i)=\dfrac{1}{\Gamma(\alpha) \theta^\alpha}x^{\alpha1}e^{x/\theta}\)
for \(x>0\). Therefore, the likelihood function:
\(L(\alpha,\theta)=\left(\dfrac{1}{\Gamma(\alpha) \theta^\alpha}\right)^n (x_1x_2\ldots x_n)^{\alpha1}\text{exp}\left[\dfrac{1}{\theta}\sum x_i\right]\)
is difficult to differentiate because of the gamma function \(\Gamma(\alpha)\). So, rather than finding the maximum likelihood estimators, what are the method of moments estimators of \(\alpha\) and \(\theta\)?
Answer
The first theoretical moment about the origin is:
\(E(X_i)=\alpha\theta\)
And the second theoretical moment about the mean is:
\(\text{Var}(X_i)=E\left[(X_i\mu)^2\right]=\alpha\theta^2\)
Again, since we have two parameters for which we are trying to derive method of moments estimators, we need two equations. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\(E(X)=\alpha\theta=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\)
And, equating the second theoretical moment about the mean with the corresponding sample moment, we get:
\(Var(X)=\alpha\theta^2=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
Now, we just have to solve for the two parameters \(\alpha\) and \(\theta\). Let'sstart by solving for \(\alpha\) in the first equation \((E(X))\). Doing so, we get:
\(\alpha=\dfrac{\bar{X}}{\theta}\)
Now, substituting \(\alpha=\dfrac{\bar{X}}{\theta}\) into the second equation (\(\text{Var}(X)\)), we get:
\(\alpha\theta^2=\left(\dfrac{\bar{X}}{\theta}\right)\theta^2=\bar{X}\theta=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
Now, solving for \(\theta\)in that last equation, and putting on its hat, we get that the method of moment estimator for \(\theta\) is:
\(\hat{\theta}_{MM}=\dfrac{1}{n\bar{X}}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
And, substituting that value of \(\theta\)back into the equation we have for \(\alpha\), and putting on its hat, we get that the method of moment estimator for \(\alpha\) is:
\(\hat{\alpha}_{MM}=\dfrac{\bar{X}}{\hat{\theta}_{MM}}=\dfrac{\bar{X}}{(1/n\bar{X})\sum\limits_{i=1}^n (X_i\bar{X})^2}=\dfrac{n\bar{X}^2}{\sum\limits_{i=1}^n (X_i\bar{X})^2}\)
Example 110
Let's return to the example in which \(X_1, X_2, \ldots, X_n\) are normal random variables with mean \(\mu\) and variance \(\sigma^2\). What are the method of moments estimators of the mean \(\mu\) and variance \(\sigma^2\)?
Answer
The first theoretical moment about the origin is:
\(E(X_i)=\mu\)
And, the second theoretical moment about the mean is:
\(\text{Var}(X_i)=E\left[(X_i\mu)^2\right]=\sigma^2\)
Again, since we have two parameters for which we are trying to derive method of moments estimators, we need two equations. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\(E(X)=\mu=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\)
And, equating the second theoretical moment about the mean with the corresponding sample moment, we get:
\(\sigma^2=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
Now, we just have to solve for the two parameters. Oh! Well, in this case, the equations are already solved for \(\mu\)and \(\sigma^2\). Our work is done! We just need to put a hat (^) on the parameters to make it clear that they are estimators. Doing so, we get that the method of moments estimator of \(\mu\)is:
\(\hat{\mu}_{MM}=\bar{X}\)
(which we know, from our previous work, is unbiased). The method of moments estimator of \(\sigma^2\)is:
\(\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i\bar{X})^2\)
(which we know, from our previous work, is biased). This example, in conjunction with the second example, illustrates how the two different forms of the method can require varying amounts of work depending on the situation.