Section 5: Distributions of Functions of Random Variables

Section 5: Distributions of Functions of Random Variables
central limit theorem

As the name of this section suggests, we will now spend some time learning how to find the probability distribution of functions of random variables. For example, we might know the probability density function of X, but want to know instead the probability density function of u(X)=X2. We'll learn several different techniques for finding the distribution of functions of random variables, including the distribution function technique, the change-of-variable technique and the moment-generating function technique.

The more important functions of random variables that we'll explore will be those involving random variables that are independent and identically distributed. For example, if X1 is the weight of a randomly selected individual from the population of males, X2 is the weight of another randomly selected individual from the population of males, ..., and Xn is the weight of yet another randomly selected individual from the population of males, then we might be interested in learning how the random function:

X¯=X1+X2++Xnn

is distributed. We'll first learn how X¯ is distributed assuming that the Xi's are normally distributed. Then, we'll strip away the assumption of normality, and use a classic theorem, called the Central Limit Theorem, to show that, for large n, the function:

n(X¯μ)σ

approximately follows the standard normal distribution. Finally, we'll use the Central Limit Theorem to use the normal distribution to approximate discrete distributions, such as the binomial distribution and the Poisson distribution.


Lesson 22: Functions of One Random Variable

Lesson 22: Functions of One Random Variable

Overview

We'll begin our exploration of the distributions of functions of random variables, by focusing on simple functions of one random variable. For example, if X is a continuous random variable, and we take a function of X, say:

Y=u(X)

then Y is also a continuous random variable that has its own probability distribution. We'll learn how to find the probability density function of Y, using two different techniques, namely the distribution function technique and the change-of-variable technique. At first, we'll focus only on one-to-one functions. Then, once we have that mastered, we'll learn how to modify the change-of-variable technique to find the probability of a random variable that is derived from a two-to-one function. Finally, we'll learn how the inverse of a cumulative distribution function can help us simulate random numbers that follow a particular probability distribution.

Objectives

Upon completion of this lesson, you should be able to:

  • To learn how to use the distribution function technique to find the probability distribution of Y=u(X), a one-to-one transformation of a random variable X.
  • To learn how to use the change-of-variable technique to find the probability distribution of Y=u(X), a one-to-one transformation of a random variable X.
  • To learn how to use the change-of-variable technique to find the probability distribution of Y=u(X), a two-to-one transformation of a random variable X.
  • To learn how to use a cumulative distribution function to simulate random numbers that follow a particular probability distribution.
  • To understand all of the proofs in the lesson.
  • To be able to apply the methods learned in the lesson to new problems.

22.1 - Distribution Function Technique

22.1 - Distribution Function Technique

You might not have been aware of it at the time, but we have already used the distribution function technique at least twice in this course to find the probability density function of a function of a random variable. For example, we used the distribution function technique to show that:

Z=Xμσ

follows a standard normal distribution when X is normally distributed with mean μ and standard deviation σ. And, we used the distribution function technique to show that, when Z follows the standard normal distribution:

Z2

follows the chi-square distribution with 1 degree of freedom. In summary, we used the distribution function technique to find the p.d.f. of the random function Y=u(X) by:

  1. First, finding the cumulative distribution function:

    FY(y)=P(Yy)

  2. Then, differentiating the cumulative distribution function F(y) to get the probability density function f(y). That is:

    fY(y)=FY(y)

Now that we've officially stated the distribution function technique, let's take a look at a few more examples.

Example 22-1

Let X be a continuous random variable with the following probability density function:

1x1y=x2

 

f(x)=3x2

for 0<x<1. What is the probability density function of Y=X2?

Solution

If you look at the graph of the function (above and to the right) of Y=X2, you might note that (1) the function is an increasing function of X, and (2) 0<y<1. That noted, let's now use the distribution function technique to find the p.d.f. of Y. First, we find the cumulative distribution function of Y:

Having shown that the cumulative distribution function of Y is:

FY(y)=y3/2

for 0<y<1, we now just need to differentiate F(y) to get the probability density function f(y). Doing so, we get:

fY(y)=FY(y)=32y1/2

for 0<y<1. Our calculation is complete! We have successfully used the distribution function technique to find the p.d.f of Y, when Y was an increasing function of X. (By the way, you might find it reassuring to verify that f(y) does indeed integrate to 1 over the support of y. In general, that's not a bad thing to check.)

One thing you might note in the last example is that great care was used to subscript the cumulative distribution functions and probability density functions with either an X or a Y to indicate to which random variable the functions belonged. For example, in finding the cumulative distribution function of Y, we started with the cumulative distribution function of Y, and ended up with a cumulative distribution function of X! If we didn't use the subscripts, we would have had a good chance of throwing up our hands and botching the calculation. In short, using subscripts is a good habit to follow!

Example 22-2

 

1x1y=(1-x)31/2 , 1/8

Let X be a continuous random variable with the following probability density function:

 

f(x)=3(1x)2

for 0<x<1. What is the probability density function of Y=(1X)3 ?

Solution

If you look at the graph of the function (above and to the right) of:

Y=(1X)3

you might note that the function is a decreasing function of X, and  0<y<1. That noted, let's now use the distribution function technique to find the p.d.f. of Y. First, we find the cumulative distribution function of Y:

Having shown that the cumulative distribution function of Y is:

FY(y)=y

for 0<y<1, we now just need to differentiate F(y) to get the probability density function f(y). Doing so, we get:

fY(y)=FY(y)=1

for 0<y<1. That is, Y is a U(0,1) random variable. (Again, you might find it reassuring to verify that f(y) does indeed integrate to 1 over the support of y.)


22.2 - Change-of-Variable Technique

22.2 - Change-of-Variable Technique

On the last page, we used the distribution function technique in two different examples. In the first example, the transformation of X involved an increasing function, while in the second example, the transformation of X involved a decreasing function. On this page, we'll generalize what we did there first for an increasing function and then for a decreasing function. The generalizations lead to what is called the change-of-variable technique.

Generalization for an Increasing Function

Let X be a continuous random variable with a generic p.d.f. f(x) defined over the support c1<x<c2. And, let Y=u(X) be a continuous, increasing function of X with inverse function X=v(Y). Here's a picture of what the continuous, increasing function might look like:

The blue curve, of course, represents the continuous and increasing function Y=u(X). If you put an x-value, such as c1 and c2, into the function Y=u(X), you get a y-value, such as u(c1) and u(c2). But, because the function is continuous and increasing, an inverse function X=v(Y) exists. In that case, if you put a y-value into the function X=v(Y), you get an x-value, such as v(y).

Okay, now that we have described the scenario, let's derive the distribution function of Y. It is:

FY(y)=P(Yy)=P(u(X)y)=P(Xv(y))=c1v(y)f(x)dx

for d1=u(c1)<y<u(c2)=d2. The first equality holds from the definition of the cumulative distribution function of Y. The second equality holds because Y=u(X). The third equality holds because, as shown in red on the following graph, for the portion of the function for which u(X)y, it is also true that Xv(Y):

X=v(Y)Y=μ(X)yv(y)C1C1u(C1)u(C2)

And, the last equality holds from the definition of probability for a continuous random variable X. Now, we just have to take the derivative of FY(y), the cumulative distribution function of Y, to get fY(y), the probability density function of Y. The Fundamental Theorem of Calculus, in conjunction with the Chain Rule, tells us that the derivative is:

fY(y)=FY(y)=fx(v(y))v(y)

for d1=u(c1)<y<u(c2)=d2.

Generalization for a Decreasing Function

Let X be a continuous random variable with a generic p.d.f. f(x) defined over the support c1<x<c2. And, let Y=u(X) be a continuous, decreasing function of X with inverse function X=v(Y). Here's a picture of what the continuous, decreasing function might look like:

X=v(Y)Y=μ(X)yv(y)C1C1u(C1)u(C2)

The blue curve, of course, represents the continuous and decreasing function Y=u(X). Again, if you put an x-value, such as c1 and c2, into the function Y=u(X), you get a y-value, such as u(c1) and u(c2). But, because the function is continuous and decreasing, an inverse function X=v(Y) exists. In that case, if you put a y-value into the function X=v(Y), you get an x-value, such as v(y).

That said, the distribution function of Y is then:

FY(y)=P(Yy)=P(u(X)y)=P(Xv(y))=1P(Xv(y))=1c1v(y)f(x)dx

for d2=u(c2)<y<u(c1)=d1. The first equality holds from the definition of the cumulative distribution function of Y. The second equality holds because Y=u(X). The third equality holds because, as shown in red on the following graph, for the portion of the function for which u(X)y, it is also true that Xv(Y):

X=v(Y)Y=μ(X)yv(y)C1C1u(C1)u(C2)

The fourth equality holds from the rule of complementary events. And, the last equality holds from the definition of probability for a continuous random variable X. Now, we just have to take the derivative of FY(y), the cumulative distribution function of Y, to get fY(y), the probability density function of Y. Again, the Fundamental Theorem of Calculus, in conjunction with the Chain Rule, tells us that the derivative is:

fY(y)=FY(y)=fx(v(y))v(y)

for d2=u(c2)<y<u(c1)=d1. You might be alarmed in that it seems that the p.d.f. f(y) is negative, but note that the derivative of v(y) is negative, because X=v(Y) is a decreasing function in Y. Therefore, the two negatives cancel each other out, and therefore make f(y) positive.

Phew! We have now derived what is called the change-of-variable technique first for an increasing function and then for a decreasing function. But, continuous, increasing functions and continuous, decreasing functions, by their one-to-one nature, are both invertible functions. Let's, once and for all, then write the change-of-variable technique for any generic invertible function.

Definition. Let X be a continuous random variable with generic probability density function f(x) defined over the support c1<x<c2. And, let Y=u(X) be an invertible function of X with inverse function X=v(Y). Then, using the change-of-variable technique, the probability density function of Y is:

fY(y)=fX(v(y))×|v(y)|

defined over the support u(c1)<y<u(c2).

Having summarized the change-of-variable technique, once and for all, let's revisit an example.

Example 22-1 Continued

Let's return to our example in which X is a continuous random variable with the following probability density function:

f(x)=3x2

for 0<x<1. Use the change-of-variable technique to find the probability density function of Y=X2.

Solution

Note that the function:

Y=X2

defined over the interval 0<x<1 is an invertible function. The inverse function is:

x=v(y)=y=y1/2

for 0<y<1. (That range is because, when x=0,y=0; and when x=1,y=1). Now, taking the derivative of v(y), we get:

v(y)=12y1/2

Therefore, the change-of-variable technique:

fY(y)=fX(v(y))×|v(y)|

tells us that the probability density function of Y is:

fY(y)=3[y1/2]212y1/2

And, simplifying we get that the probability density function of Y is:

fY(y)=32y1/2

for 0<y<1. We shouldn't be surprised by this result, as it is the same result that we obtained using the distribution function technique.

Example 22-2 continued

Let's return to our example in which X is a continuous random variable with the following probability density function:

f(x)=3(1x)2

for 0<x<1. Use the change-of-variable technique to find the probability density function of Y=(1X)3.

Solution

Note that the function:

Y=(1X)3

defined over the interval 0<x<1 is an invertible function. The inverse function is:

x=v(y)=1y1/3

for 0<y<1. (That range is because, when x=0,y=1; and when x=1,y=0). Now, taking the derivative of v(y), we get:

v(y)=13y2/3

Therefore, the change-of-variable technique:

fY(y)=fX(v(y))×|v(y)|

tells us that the probability density function of Y is:

fY(y)=3[1(1y1/3)]2|13y2/3|=3y2/313y2/3

And, simplifying we get that the probability density function of Y is:

fY(y)=1

for 0<y<1. Again, we shouldn't be surprised by this result, as it is the same result that we obtained using the distribution function technique.


22.3 - Two-to-One Functions

22.3 - Two-to-One Functions

You might have noticed that all of the examples we have looked at so far involved monotonic functions that, because of their one-to-one nature, could therefore be inverted. The question naturally arises then as to how we modify the change-of-variable technique in the situation in which the transformation is not monotonic, and therefore not one-to-one. That's what we'll explore on this page! We'll start with an example in which the transformation is two-to-one. We'll use the distribution function technique to find the p.d.f of the transformed random variable. In so doing, we'll take note of how the change-of-variable technique must be modified to handle the two-to-one portion of the transformation. After summarizing the necessary modification to the change-of-variable technique, we'll take a look at another example using the change-of-variable technique.

Example 22-3

Suppose X is a continuous random variable with probability density function:

f(x)=x23

for 1<x<2. What is the p.d.f. of Y=X2?

Solution

First, note that the transformation:

Y=X2

is not one-to-one over the interval 1<x<2:

4123X2= -√Y=v1(Y)X2=+(Y)√Y=v2(Y)X2=+(Y)√Y=v2(Y)yx-112

For example, in the interval 1<x<1, if we take the inverse of Y=X2, we get:

X1=Y=v1(Y)

for 1<x<0, and:

X2=+Y=v2(Y)

for 0<x<1.

As the graph suggests, the transformation is two-to-one between when 0<y<1, and one-to-one when 1<y<4. So, let's use the distribution function technique, separately, over each of these ranges. First, consider when 0<y<1. In that case:

FY(y)=P(Yy)=P(X2y)=P(yXy)=FX(y)FX(y)

The first equality holds by the definition of the cumulative distribution function. The second equality holds because the transformation of interest is Y=X2. The third equality holds, because when X2y, the random variable X is between the positive and negative square roots of y. And, the last equality holds again by the definition of the cumulative distribution function. Now, taking the derivative of the cumulative distribution function F(y), we get (from the Fundamental Theorem of Calculus and the Chain Rule) the probability density function f(y):

fY(y)=FY(y)=fX(y)12y1/2+fX(y)12y1/2

Using what we know about the probability density function of X:

f(x)=x23

we get:

fY(y)=(y)2312y1/2+(y)2312y1/2

And, simplifying, we get:

fY(y)=16y1/2+16y1/2=y3

for 0<y<1. Note that it readily becomes apparent that in the case of a two-to-one transformation, we need to sum two terms, each of which arises from a one-to-one transformation.

So, we've found the p.d.f. of Y when 0<y<1. Now, we have to find the p.d.f. of Y when 1<y<4. In that case:

FY(y)=P(Yy)=P(X2y)=P(Xy)=FX(y)

The first equality holds by the definition of the cumulative distribution function. The second equality holds because Y=X2. The third equality holds, because when X2y, the random variable Xy. And, the last equality holds again by the definition of the cumulative distribution function. Now, taking the derivative of the cumulative distribution function F(y), we get (from the Fundamental Theorem of Calculus and the Chain Rule) the probability density function f(y):

fY(y)=FY(y)=fX(y)12y1/2

Again, using what we know about the probability density function of X, and simplifying, we get:

fY(y)=(y)2312y1/2=y6

for 1<y<4.

Now that we've seen how the distribution function technique works when we have a two-to-one function, we should now be able to summarize the necessary modifications to the change-of-variable technique.

Generalization

Let X be a continuous random variable with probability density function f(x) for c1<x<c2.

Let Y=u(X) be a continuous two-to-one function of X, which can be “broken up” into two one-to-one invertible functions with:

X1=v1(Y) and X2=v2(Y)

  1. Then, the probability density function for the two-to-one portion of Y is:

    fY(y)=fX(v1(y))|v1(y)|+fX(v2(y))|v2(y)|

    for the “appropriate support” for y. That is, you have to add the one-to-one portions together.

  2. And, the probability density function for the one-to-one portion of Y is, as always:

    fY(y)=fX(v2(y))|v2(y)|

    for the “appropriate support” for y.

Example 22-4

Suppose X is a continuous random variable with that follows the standard normal distribution with, of course, <x<. Use the change-of-variable technique to show that the p.d.f. of Y=X2 is the chi-square distribution with 1 degree of freedom.

Solution

The transformation Y=X2 is two-to-one over the entire support <x<:

X2= v2(Y) = √YX1= v1(Y) = -√Yyx

That is, when <x<0, we have:

X1=Y=v1(Y)

and when 0<x<, we have:

X2=+Y=v2(Y)

Then, the change of variable technique tells us that, over the two-to-one portion of the transformation, that is, when 0<y<:

fY(y)=fX(y)|12y1/2|+fX(y)|12y1/2|

Recalling the p.d.f. of the standard normal distribution:

fX(x)=12πexp[x22]

the p.d.f. of Y is then:

fY(y)=12πexp[(y)22]|12y1/2|+12πexp[(y)22]|12y1/2|

Adding the terms together, and simplifying a bit, we get:

fY(y)=212πexp[y2]12y1/2

Crossing out the 2s, recalling that Γ(1/2)=π, and rewriting things just a bit, we should be able to recognize that, with 0<y<, the probability density function of Y:

fY(y)=1Γ(1/2)21/2ey/2y1/2

is indeed the p.d.f. of a chi-square random variable with 1 degree of freedom!


22.4 - Simulating Observations

22.4 - Simulating Observations

Now that we've learned the mechanics of the distribution function and change-of-variable techniques to find the p.d.f. of a transformation of a random variable, we'll now turn our attention for a few minutes to an application of the distribution function technique. In doing so, we'll learn how statistical software, such as Minitab or SAS, generates (or "simulates") 1000 random numbers that follow a particular probability distribution. More specifically, we'll explore how statistical software simulates, say, 1000 random numbers from an exponential distribution with mean θ=5.

The Idea

If we take a look at the cumulative distribution function of an exponential random variable with a mean of θ=5:

0 5 10 15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x Distribution function F(x) C.D.F. for Exponential R.V. with Mean 5

the idea might just jump out at us. You might notice that the cumulative distribution function F(x) is a number (a cumulative probability, in fact!) between 0 and 1. So, one strategy we might use to generate a 1000 numbers following an exponential distribution with a mean of 5 is:

  1. Generate a YU(0,1) random number. That is, generate a number between 0 and 1 such that each number between 0 and 1 is equally likely.
  2. Then, use the inverse of Y=F(x) to get a random number X=F1(y) whose distribution function is F(x). This is, in fact, illustrated on the graph. If F(x)=0.8, for example, then the inverse X is about 8.
  3. Repeat steps 1 and 2 one thousand times.

By looking at the graph, you should get the idea, by using this strategy, that the shape of the distribution function dictates the probability distribution of the resulting X values. In this case, the steepness of the curve up to about F(x)=0.8 suggests that most of the X values will be less than 8. That's what the probability density function of an exponential random variable with a mean of 5 suggests should happen:

0 5 10 15 0.0 0.1 0.2 x Density f(x) P.D.F. for Exponential R.V. with Mean 5

We can even do the calculation, of course, to illustrate this point. If X is an exponential random variable with a mean of 5, then:

P(X<8)=1P(X>8)=1e8/5=0.80

A theorem (naturally!) formalizes our idea of how to simulate random numbers following a particular probability distribution.

Theorem

Let YU(0,1). Let F(x) have the properties of a distribution function of the continuous type with F(a)=0 and F(b)=1. Suppose that F(x) is strictly increasing on the support a<x<b, where a and b could be and , respectively. Then, the random variable X defined by:

X=F1(Y)

is a continuous random variable with cumulative distribution function F(x).

Proof.

In order to prove the theorem, we need to show that the cumulative distribution function of X is F(x). That is, we need to show:

P(Xx)=F(x)

It turns out that the proof is a one-liner! Here it is:

P(Xx)=P(F1(Y)x)=P(YF(x))=F(x)

We've set out to prove what we intended, namely that:

P(Xx)=F(x)

Well, okay, maybe some explanation is needed! The first equality in the one-line proof holds, because:

X=F1(Y)

Then, the second equality holds because of the red portion of this graph:

Y=F(x)Y=F(x)y=F(x)x=F-1(y)X=F-1(y)X=F-1(Y)

That is, when:

F1(Y)x

is true, so is

YF(x)

Finally, the last equality holds because it is assumed that Y is a uniform(0, 1) random variable, and therefore the probability that Y is less than or equal to some y is, in fact, y itself:

P(Yy)=F(y)=0ydt=y

That means that the probability that Y is less than or equal to some F(x) is, in fact, F(x) itself:

P(YF(x))=F(x)

Our one-line proof is complete!

Example 22-5

A student randomly draws the following three uniform(0, 1) numbers:

0.2 0.5 0.9

Use the three uniform(0,1) numbers to generate three random numbers that follow an exponential distribution with mean θ=5.

Solution

The cumulative distribution function of an exponential random variable with a mean of 5 is:

y=F(x)=1ex/5

for 0x<. We need to invert the cumulative distribution function, that is, solve for x, in order to be able to determine the exponential(5) random numbers. Manipulating the above equation a bit, we get:

1y=ex/5

Then, taking the natural log of both sides, we get:

log(1y)=x5

And, multiplying both sides by −5, we get:

x=5log(1y)

for 0<y<1. Now, it's just a matter of inserting the student's three random U(0,1) numbers into the above equation to get our three exponential(5) random numbers:

  • If y=0.2, we get x=1.1
  • If y=0.5, we get x=3.5
  • If y=0.9, we get x=11.5

We would simply continue the same process — that is, generating y, a random U(0,1) number, inserting y into the above equation, and solving for x — 997 more times if we wanted to generate 1000 exponential(5) random numbers. Of course, we wouldn't really do it by hand, but rather let statistical software do it for us. At least we now understand how random number generation works!


Lesson 23: Transformations of Two Random Variables

Lesson 23: Transformations of Two Random Variables

Introduction

In this lesson, we consider the situation where we have two random variables and we are interested in the joint distribution of two new random variables which are a transformation of the original one. Such a transformation is called a bivariate transformation. We use a generalization of the change of variables technique which we learned in Lesson 22. We provide examples of random variables whose density functions can be derived through a bivariate transformation.

Objectives

Upon completion of this lesson, you should be able to:

  • To learn how to use the change-of-variable technique to find the probability distribution of Y1=u1(X1,X2),Y2=u2(X1,X2), a one-to-one transformation of the two random variables X1 and X2.

23.1 - Change-of-Variables Technique

23.1 - Change-of-Variables Technique

Recall, that for the univariate (one random variable) situation: Given X with pdf f(x) and the transformation Y=u(X) with the single-valued inverse X=v(Y), then the pdf of Y is given by

g(y)=|v(y)|f[v(y)].

Now, suppose (X1,X2) has joint density f(x1,x2). and support SX.

Let (Y1,Y2) be some function of (X1,X2) defined by Y1=u1(X1,X2) and Y2=u2(X1,X2) with the single-valued inverse given by X1=v1(Y1,Y2) and X2=v2(Y1,Y2). Let SY be the support of Y1,Y2.

Then, we usually find SY by considering the image of SX under the transformation (Y1,Y2). Say, given x1,x2SX, we can find (y1,y2)SY by

x1=v1(y1,y2),x2=v2(y1,y2)

The joint pdf Y1 and Y2 is

g(y1,y2)=|J|f[v1(y1,y2),v2(y1,y2)]

In the above expression, |J| refers to the absolute value of the Jacobian, J. The Jacobian, J, is given by

|v1(y1,y2)y1v1(y1,y2)y2v2(y1,y2)y1v2(y1,y2)y2|

i.e. it is the determinant of the matrix

(v1(y1,y2)y1v1(y1,y2)y2v2(y1,y2)y1v2(y1,y2)y2)

Example 23-1

Suppose X1 and X2 are independent exponential random variables with parameter λ=1 so that

fX1(x1)=ex10<x1<fX2(x2)=ex20<x2<

The joint pdf is given by

f(x1,x2)=fX1(x1)fX2(x2)=ex1x20<x1<,0<x2<

Consider the transformation: Y1=X1X2,Y2=X1+X2. We wish to find the joint distribution of Y1 and Y2.

We have

x1=y1+y22,x2=y2y12

OR

v1(y1,y2)=y1+y22,v2(y1,y2)=y2y12

The Jacobian, J is

|(y1+y22)y1(y1+y22)y2(y2y12)y1(y2y12)y2|

=|12121212|=12

So,

g(y1,y2)=ev1(y1,y2)v2(y1,y2)|12|=e[y1+y22][y2y12]|12|=ey22

Now, we determine the support of (Y1,Y2). Since 0<x1<,0<x2<, we have 0<y1+y22<,0<y2y12< or 0<y1+y2<,0<y2y1<. This may be rewritten as y2<y1<y2,0<y2<.

Using the joint pdf, we may find the marginal pdf of Y2 as

g(y2)=g(y1,y2)dy1=y2y212ey2dy1=12[ey2y1|y1=y2y1=y2]=12ey2(y2+y2)=y2ey2,0<y2<

Similarly, we may find the marginal pdf of Y1 as

g(y1)={y112ey2dy2=12ey1<y1<0y112ey2dy2=12ey10<y1<

Equivalently,

g(y1)=12e|y1|0<y1<

This pdf is known as the double exponential or Laplace pdf.


23.2 - Beta Distribution

23.2 - Beta Distribution

Let X1 and X2 have independent gamma distributions with parameters α,θ and β respectively. Therefore, the joint pdf of X1 and X2 is given by

f(x1,x2)=1Γ(α)Γ(β)θα+βx1α1x2β1 exp (x1+x2θ),0<x1<,0<x2<.

We make the following transformation:

Y1=X1X1+X2,Y2=X1+X2

The inverse transformation is given by

X1=Y1Y2,X2=Y2Y1Y2

The Jacobian is

|y2y1y21y1|=y2(1y1)+y1y2=y2

The joint pdf g(y1,y2) is

g(y1,y2)=|y2|1Γ(α)Γ(β)θα+β(y1y2)α1(y2y1y2)β1ey2/θ

with support is 0<y1<1,0<y2<

It may be shown that the marginal pdf of Y1 is

g(y1)=y1α1(1y1)β1Γ(α)Γ(β)0y2α+β1θα+βey2/θdy2g(y1)=Γ(α+β)Γ(α)Γ(β)y1α1(1y1)β1,0<y1<1.

Y1 is said to have a beta pdf with parameters α and β.


23.3 - F Distribution

23.3 - F Distribution

We describe a very useful distribution in Statistics known as the F distribution.

Let U and V be independent chi-square variables with r1 and r2 degrees of freedom, respectively. The joint pdf is

g(u,v)=ur1/21eu/2vr2/21ev/2Γ(r1/2)2r1/2Γ(r2/2)2r2/2,0<u<,0<v<

Define the random variable W=U/r1V/r2

This time we use the distribution function technique described in lesson 22,

F(w)=P(Ww)=P(U/r1V/r2w)=P(Ur1r2wV)=00(r1/r2)wvg(u,v)dudv

F(w)=1Γ(r1/2)Γ(r2/2)0[0(r1/r2)wvur1/21eu/22(r1+r2)/2du]vr1/21ev/2dv

By differentiating the cdf , it can be shown that f(w)=F(w) is given by

f(w)=(r1/r2)r1/2Γ[(r1+r2)/2]wr1/21Γ(r1/2)Γ(r2/2)[1+(r1w/r2)](r1+r2)/2,w>0

A random variable with the pdf f(w) is said to have an F distribution with r1 and r2 degrees of freedom. We write this as F(r1,r2). Table VII in Appendix B of the textbook can be used to find probabilities for a random variable with the F(r1,r2) distribution.

It contains the F-values for various cumulative probabilities (0.95,0.975,0.99) (or the equivalent upper − αth probabilities (0.05,0.025,0.01)) of various F(r1,r2) distributions.

When using this table, it is helpful to note that if a random variable (say, W) has the F(r1,r2) distribution, then its inverse 1W has the F(r2,r1) distribution.

Illustration

The shape of the F distribution is determined by the degrees of freedom r1 and r2. The histogram below shows how an F random variable is generated using 1000 observations each from two chi-square random variables (U and V) with degrees of freedom 4 and 8 respectively and forming the ratio U/4V/8.

The lower plot (below histogram) illustrates how the shape of an F distribution changes with the degrees of freedom r1 and r2.

0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Histogram of F (4,8) F (4,8)
F (2, 4) F (4, 6) F (12, 12) 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 Density Histogram of F (4,8) F

Lesson 24: Several Independent Random Variables

Lesson 24: Several Independent Random Variables

Introduction

Pumpkins in a cart

In the previous lessons, we explored functions of random variables. We'll do the same in this lesson, too, except here we'll add the requirement that the random variables be independent, and in some cases, identically distributed. Suppose, for example, that we were interested in determining the average weight of the thousands of pumpkins grown on a pumpkin farm. Since we couldn't possibly weigh all of the pumpkins on the farm, we'd want to weigh just a small random sample of pumpkins. If we let:

  • X1 denote the weight of the first pumpkin sampled
  • X2 denote the weight of the second pumpkin sampled
  • ...
  • Xn denote the weight of the nth pumpkin sampled

then we could imagine calculating the average weight of the sampled pumpkins as:

X¯=X1+X2++Xnn

Now, because the pumpkins were randomly sampled, we wouldn't expect the weight of one pumpkin, say X1, to affect the weight of another pumpkin, say X2. Therefore, X1,X2,,Xn can be assumed to be independent random variables. And, since X¯ , as defined above, is a function of those independent random variables, it too must be a random variable with a certain probability distribution, a certain mean and a certain variance. Our work in this lesson will all be directed towards the end goal of being able to calculate the mean and variance of the random variable X¯. We'll learn a number things along the way, of course, including a formal definition of a random sample, the expectation of a product of independent variables, and the mean and variance of a linear combination of independent random variables.

Objectives

Upon completion of this lesson, you should be able to:

  • To get the big picture for the remainder of the course.
  • To learn a formal definition of a random sample.
  • To learn what i.i.d. means.
  • To learn how to find the expectation of a function of n independent random variables.
  • To learn how to find the expectation of a product of functions of n independent random variables.
  • To learn how to find the mean and variance of a linear combination of random variables.
  • To learn that the expected value of the sample mean is μ.
  • To learn that the variance of the sample mean is σ2n.
  • To understand all of the proofs presented in the lesson.
  • To be able to apply the methods learned in this lesson to new problems.

24.1 - Some Motivation

24.1 - Some Motivation

Consider the population of 8 million college students. Suppose we are interested in determining μ, the unknown mean distance (in miles) from the students' schools to their hometowns. We can't possibly determine the distance for each of the 8 million students in order to calculate the population mean μ and the population variance σ2. We could, however, take a random sample of, say, 100 college students, determine:

Xi= the distance (in miles) from the home of student i for i=1,2,,100

and use the resulting data to learn about the population of college students. How could we obtain that random sample though? Would it be okay to stand outside a major classroom building on the Penn State campus, such as the Willard Building, and ask random students how far they are from their hometown? Probably not! The average distance for Penn State students probably differs greatly from that of college students attending a school in a major city, such as, say The University of California in Los Angeles (UCLA). We need to use a method that ensures that the sample is representative of all college students in the population, not just a subset of the students. Any method that ensures that our sample is truly random will suffice. The following definition formalizes what makes a sample truly random.

Definition. The random variables Xi constitute a random sample of size n if and only if:

  1. the Xi are independent, and

  2. the Xi are identically distributed, that is, each Xi comes from the same distribution f(x) with mean μ and variance σ2.

We say that the Xi are "i.i.d." (The first i. stands for independent, and the i.d. stands for identically distributed.)

Now, once we've obtained our (truly) random sample, we'll probably want to use the resulting data to calculate the sample mean:

X¯=i=1nXin=X1+X2++X100100

and sample variance:

S2=i=1n(XiX¯)2n1=(X1X¯)2++(X100X¯)299

In Stat 415, we'll learn that the sample mean X¯ is the "best" estimate of the population mean μ and the sample variance S2 is the "best" estimate of the population variance σ2. (We'll also learn in what sense the estimates are "best.") Now, before we can use the sample mean and sample variance to draw conclusions about the possible values of the unknown population mean μ and unknown population variance σ2, we need to know how X¯ and S2 behave. That is, we need to know: [3]

  • the probability distribution of X¯ and S2
  • the theoretical mean of of X¯ and S2
  • the theoretical variance of X¯ and S2

Now, note that X¯ and S2 are sums of independent random variables. That's why we are working in a lesson right now called Several Independent Random Variables. In this lesson, we'll learn about the mean and variance of the random variable X¯. Then, in the lesson called Random Functions Associated with Normal Distributions, we'll add the assumption that the Xi are measurements from a normal distribution with mean μ and variance σ2 to see what we can learn about the probability distribution of X¯ and S2. In the lesson called The Central Limit Theorem, we'll learn that those results still hold even if our measurements aren't from a normal distribution, providing we have a large enough sample. Along the way, we'll pick up a new tool for our toolbox, namely The Moment-Generating Function Technique. And in the final lesson for the Section (and Course!), we'll see another application of the Central Limit Theorem, namely using the normal distribution to approximate discrete distributions, such as the binomial and Poisson distributions. With our motivation presented, and our curiosity now piqued, let's jump right in and get going!


24.2 - Expectations of Functions of Independent Random Variables

24.2 - Expectations of Functions of Independent Random Variables

One of our primary goals of this lesson is to determine the theoretical mean and variance of the sample mean:

X¯=X1+X2++Xnn

Now, assume the Xi are independent, as they should be if they come from a random sample. Then, finding the theoretical mean of the sample mean involves taking the expectation of a sum of independent random variables:

E(X¯)=1nE(X1+X2++Xn)

That's why we'll spend some time on this page learning how to take expectations of functions of independent random variables! A simple example illustrates that we already have a number of techniques sitting in our toolbox ready to help us find the expectation of a sum of independent random variables.

Example 24-1

two pennies

Suppose we toss a penny three times. Let X1 denote the number of heads that we get in the three tosses. And, suppose we toss a second penny two times. Let X2 denote the number of heads we get in those two tosses. If we let:

Y=X1+X2

then Y denotes the number of heads in five tosses. Note that the random variables X1 and X2 are independent and therefore Y is the sum of independent random variables. Furthermore, we know that:

  • X1 is a binomial random variable with n=3 and p=12
  • X2 is a binomial random variable with n=2 and p=12
  • Y is a binomial random variable with n=5 and p=12

What is the mean of Y, the sum of two independent random variables? And, what is the variance of Y?

Solution

We can calculate the mean and variance of Y in three different ways.

  1. By recognizing that Y is a binomial random variable with n=5 and p=12, we can use what know about the mean and variance of a binomial random variable, namely that the mean of Y is:

    E(Y)=np=5(12)=52

    and the variance of Y is:

    Var(Y)=np(1p)=5(12)(12)=54

    Since sums of independent random variables are not always going to be binomial, this approach won't always work, of course. It would be good to have alternative methods in hand!

  2. We could use the linear operator property of expectation. Before doing so, it would be helpful to note that the mean of X1 is:

    E(X1)=np=3(12)=32

    and the mean of X2 is:

    E(X2)=np=2(12)=1

    Now, using the property, we get that the mean of Y is (thankfully) again 52:

    E(Y)=E(X1+X2)=E(X1)+E(X2)=32+1=52

    Recall that the second equality comes from the linear operator property of expectation. Now, using the linear operator property of expectation to find the variance of Y takes a bit more work. First, we should note that the variance of X1 is:

    Var(X1)=np(1p)=3(12)(12)=34

    and the variance of X2 is:

    Var(X2)=np(1p)=2(12)(12)=12

    Now, we can (thankfully) show again that the variance of Y is 54:

    Okay, as if two methods aren't enough, we still have one more method we could use.

  3. We could use the independence of the two random variables X1 and X2, in conjunction with the definition of expected value of Y as we know it. First, using the binomial formula, note that we can present the probability mass function of X1 in tabular form as:

    x1 0 1 2 3
    f(x1) 18 38 38 18

    And, we can present the probability mass function of X2 in tabular form as well:

    x2 0 1 2
    f(x2) 14 24 14

    Now, recall that if X1 and X2 are independent random variables, then:

    f(x1,x2)=f(x1)f(x2)

    We can use this result to help determine g(y), the probability mass function of Y. First note that, since Y is the sum of X1 and X2, the support of Y is {0, 1, 2, 3, 4 and 5}. Now, by brute force, we get:

    g(0)=P(Y=0)=P(X1=0,X2=0)=f(0,0)=fX1(0)fX2(0)=1814=132

    The second equality comes from the fact that the only way that Y can equal 0 is if X1=0 and X2=0, and the fourth equality comes from the independence of X1and X2. We can make a similar calculation to find the probability that Y=1:

    g(1)=P(X1=0,X2=1)+P(X1=1,X2=0)=fX1(0)fX2(1)+fX1(1)fX2(0)=1824+3814=532

    The first equality comes from the fact that there are two (mutually exclusive) ways that Y can equal 1, namely if X1=0 and X2=1 or if X1=1 and X2=0. The second equality comes from the independence of X1 and X2. We can make similar calculations to find g(2),g(3),g(4), and g(5). Once we've done that, we can present the p.m.f. of Y in tabular form as:

    y=x1+x2 0 1 2 3 4 5
    g(y) 132 532 1032 1032 532 132

    Then, it is a straightforward calculation to use the definition of the expected value of a discrete random variable to determine that (again!) the expected value of Y is 52

    E(Y)=0(132)+1(532)+2(1032)++5(132)=8032=52

    The variance of Y can be calculated similarly. (Do you want to calculate it one more time?!)

    The following summarizes the method we've used here in calculating the expected value of Y:

    E(Y)=E(X1+X2)=x1S1x2S2(x1+x2)f(x1,x2)=x1S1x2S2(x1+x2)f(x1)f(x2)=ySyg(y)

    The first equality comes, of course, from the definition of Y. The second equality comes from the definition of the expectation of a function of discrete random variables. The third equality comes from the independence of the random variables X1 and X2. And, the fourth equality comes from the definition of the expected value of Y, as well as the fact that g(y) can be determined by summing the appropriate joint probabilities of X1 and X2.

The following theorem formally states the third method we used in determining the expected value of Y, the function of two independent random variables. We state the theorem without proof. (If you're interested, you can find a proof of it in Hogg, McKean and Craig, 2005.)

Theorem

Let X1,X2,,Xn be n independent random variables that, by their independence, have the joint probability mass function:

f1(x1)f2(x2)fn(xn)

Let the random variable Y=u(X1,X2,,Xn) have the probability mass function g(y). Then, in the discrete case:

E(Y)=yyg(y)=x1x2xnu(x1,x2,,xn)f1(x1)f2(x2)fn(xn)

provided that these summations exist. For continuous random variables, integrals replace the summations.

In the special case that we are looking for the expectation of the product of functions of n independent random variables, the following theorem will help us out.

Theorem
If X1,X2,,Xn are independent random variables and, for i=1,2,,n, the expectation E[ui(Xi)] exists, then:

E[u1(x1)u2(x2)un(xn)]=E[u1(x1)]E[u2(x2)]E[un(xn)]

That is, the expectation of the product is the product of the expectations.

Proof

For the sake of concreteness, let's assume that the random variables are discrete. Then, the definition of expectation gives us:

E[u1(x1)u2(x2)un(xn)]=x1x2xnu1(x1)u2(x2)un(xn)f1(x1)f2(x2)fn(xn)

Then, since functions that don't depend on the index of the summation signs can get pulled through the summation signs, we have:

E[u1(x1)u2(x2)un(xn)]=x1u1(x1)f1(x1)x2u2(x2)f2(x2)xnun(xn)fn(xn)

Then, by the definition, in the discrete case, of the expected value of ui(Xi), our expectation reduces to:

E[u1(x1)u2(x2)un(xn)]=E[u1(x1)]E[u2(x2)]E[un(xn)]

Our proof is complete. If our random variables are instead continuous, the proof would be similar. We would just need to make the obvious change of replacing the summation signs with integrals.

Let's return to our example in which we toss a penny three times, and let X1 denote the number of heads that we get in the three tosses. And, again toss a second penny two times, and let X2 denote the number of heads we get in those two tosses. In our previous work, we learned that:

  • E(X1)=32 and Var(X1)=34
  • E(X2)=1 and Var(X2)=12

What is the expected value of X12X2?

Solution

We'll use the fact that the expectation of the product is the product of the expectations:


24.3 - Mean and Variance of Linear Combinations

24.3 - Mean and Variance of Linear Combinations

We are still working towards finding the theoretical mean and variance of the sample mean:

X¯=X1+X2++Xnn

If we re-write the formula for the sample mean just a bit:

X¯=1nX1+1nX2++1nXn

we can see more clearly that the sample mean is a linear combination of the random variables X1,X2,,Xn. That's why the title and subject of this page! That is, here on this page, we'll add a few a more tools to our toolbox, namely determining the mean and variance of a linear combination of random variables X1,X2,,Xn. Before presenting and proving the major theorem on this page, let's revisit again, by way of example, why we would expect the sample mean and sample variance to have a theoretical mean and variance.

Example 24-2

A statistics instructor conducted a survey in her class. The instructor was interested in learning how many siblings, on average, the students at Penn State University have? She took a random sample of n=4 students, and asked each student how many siblings he/she has. The resulting data were: 0, 2, 1, 1. In an attempt to summarize the data she collected, the instructor calculated the sample mean and sample variance, getting:

X¯=44=1 and S2=(01)2+(21)2+(11)2+(11)23=23

The instructor realized though, that if she had asked a different sample of n=4 students how many siblings they have, she'd probably get different results. So, she took a different random sample of n=4 students. The resulting data were: 4, 1, 2, 1. Calculating the sample mean and variance once again, she determined:

X¯=84=2 and S2=(42)2+(12)2+(22)2+(12)23=63=2

Hmmm, the instructor thought that was quite a different result from the first sample, so she decided to take yet another sample of n=4 students. Doing so, the resulting data were: 5, 3, 2, 2. Calculating the sample mean and variance yet again, she determined:

X¯=124=3 and S2=(53)2+(33)2+(23)2+(23)23=63=2

That's enough of this! I think you can probably see where we are going with this example. It is very clear that the values of the sample mean X¯and the sample variance S2 depend on the selected random sample. That is, X¯ and S2 are continuous random variables in their own right. Therefore, they themselves should each have a particular:

  1. probability distribution (called a "sampling distribution"),
  2. mean, and
  3. variance.

We are still in the hunt for all three of these items. The next theorem will help move us closer towards finding the mean and variance of the sample mean X¯.

Theorem

Suppose X1,X2,,Xn are n independent random variables with means μ1,μ2,,μn and variances σ12,σ22,,σn2.

Then, the mean and variance of the linear combination Y=i=1naiXi, where a1,a2,,an are real constants are:

μY=i=1naiμi

and:

σY2=i=1nai2σi2

respectively.

Proof

Let's start with the proof for the mean first:

Now for the proof for the variance. Starting with the definition of the variance of Y, we have:

σY2=Var(Y)=E[(YμY)2]

Now, substituting what we know about Y and the mean of Y Y, we have:

σY2=E[(i=1naiXii=1naiμi)2]

Because the summation signs have the same index (i=1 to n), we can replace the two summation signs with one summation sign:

σY2=E[(i=1n(aiXiaiμi))2]

And, we can factor out the constants ai:

σY2=E[(i=1nai(Xiμi))2]

Now, let's rewrite the squared term as the product of two terms. In doing so, use an index of i on the first summation sign, and an index of j on the second summation sign:

σY2=E[(i=1nai(Xiμi))(j=1naj(Xjμj))]

Now, let's pull the summation signs together:

σY2=E[i=1nj=1naiaj(Xiμi)(Xjμj)]

Then, by the linear operator property of expectation, we can distribute the expectation:

σY2=i=1nj=1naiajE[(Xiμi)(Xjμj)]

Now, let's rewrite the variance of Y by evaluating each of the terms from i=1 to n and j=1 to n. In doing so, recognize that when i=j, the expectation term is the variance of Xi, and when ij, the expectation term is the covariance between Xi and Xj, which by the assumed independence, is 0:

var Y

Simplifying then, we get:

σY2=a12E[(X1μ1)2]+a22E[(X2μ2)2]++an2E[(Xnμn)2]

And, simplifying yet more using variance notation:

σY2=a12σ12+a22σ22++an2σn2

Finally, we have:

σY2=i=1nai2σi2

as was to be proved.

Example 24-3

Let X1 and X2 be independent random variables. Suppose the mean and variance of X1 are 2 and 4, respectively. Suppose, the mean and variance of X2 are 3 and 5 respectively. What is the mean and variance of X1+X2?

Solution

The mean of the sum is:

E(X1+X2)=E(X1)+E(X2)=2+3=5

and the variance of the sum is:

Var(X1+X2)=(1)2Var(X1)+(1)2Var(X2)=4+5=9

What is the mean and variance of X1X2?

Solution

The mean of the difference is:

E(X1X2)=E(X1)E(X2)=23=1

and the variance of the difference is:

Var(X1X2)=Var(X1+(1)X2)=(1)2Var(X1)+(1)2Var(X2)=4+5=9

That is, the variance of the difference in the two random variables is the same as the variance of the sum of the two random variables.

What is the mean and variance of 3X1+4X2?

Solution

The mean of the linear combination is:

E(3X1+4X2)=3E(X1)+4E(X2)=3(2)+4(3)=18

and the variance of the linear combination is:

Var(3X1+4X2)=(3)2Var(X1)+(4)2Var(X2)=9(4)+16(5)=116


24.4 - Mean and Variance of Sample Mean

24.4 - Mean and Variance of Sample Mean

We'll finally accomplish what we set out to do in this lesson, namely to determine the theoretical mean and variance of the continuous random variable X¯. In doing so, we'll discover the major implications of the theorem that we learned on the previous page.

Let X1,X2,,Xn be a random sample of size n from a distribution (population) with mean μ and variance σ2. What is the mean, that is, the expected value, of the sample mean X¯?

Solution

Starting with the definition of the sample mean, we have:

E(X¯)=E(X1+X2++Xnn)

Then, using the linear operator property of expectation, we get:

E(X¯)=1n[E(X1)+E(X2)++E(Xn)]

Now, the Xi are identically distributed, which means they have the same mean μ. Therefore, replacing E(Xi) with the alternative notation μ, we get:

E(X¯)=1n[μ+μ++μ]

Now, because there are n μ's in the above formula, we can rewrite the expected value as:

E(X¯)=1n[nμ]=μ

We have shown that the mean (or expected value, if you prefer) of the sample mean X¯ is μ. That is, we have shown that the mean of X¯ is the same as the mean of the individual Xi.

Let X1,X2,,Xn be a random sample of size n from a distribution (population) with mean μ and variance σ2. What is the variance of X¯?

Solution

Starting with the definition of the sample mean, we have:

Var(X¯)=Var(X1+X2++Xnn)

Rewriting the term on the right so that it is clear that we have a linear combination of Xi's, we get:

Var(X¯)=Var(1nX1+1nX2++1nXn)

Then, applying the theorem on the last page, we get:

Var(X¯)=1n2Var(X1)+1n2Var(X2)++1n2Var(Xn)

Now, the Xi are identically distributed, which means they have the same variance σ2. Therefore, replacing Var(Xi) with the alternative notation σ2, we get:

Var(X¯)=1n2[σ2+σ2++σ2]

Now, because there are n σ2's in the above formula, we can rewrite the expected value as:

Var(X¯)=1n2[nσ2]=σ2n

Our result indicates that as the sample size n increases, the variance of the sample mean decreases. That suggests that on the previous page, if the instructor had taken larger samples of students, she would have seen less variability in the sample means that she was obtaining. This is a good thing, but of course, in general, the costs of research studies no doubt increase as the sample size n increases. There is always a trade-off!


24.5 - More Examples

24.5 - More Examples

On this page, we'll just take a look at a few examples that use the material and methods we learned about in this lesson.

Example 24-4

If X1,X2,,Xn are a random sample from a population with mean μ and variance σ2, then what is:

E[(Xiμ)(Xjμ)]

for ij, i=1,2,,n?

Solution

The fact that X1,X2,,Xn constitute a random sample tells us that (1) Xi is independent of Xj, for all ij, and (2) the Xi are identically distributed. Now, we know from our previous work that if Xi is independent of Xj, for ij, then the covariance between Xi is independent of Xj is 0. That is:

E[(Xiμ)(Xjμ)]=Cov(Xi,Xj)=0

Example 24-5

Let X1,X2,X3 be a random sample of size n=3 from a distribution with the geometric probability mass function:

f(x)=(34)(14)x1

for x=1,2,3,. What is P(maxXi2)?

Solution

The only way that the maximum of the Xi will be less than or equal to 2 is if all of the Xi are less than or equal to 2. That is:

P(maxXi2)=P(X12,X22,X32)

Now, because X1,X2,X3 are a random sample, we know that (1) Xi is independent of Xj, for all ij, and (2) the Xi are identically distributed. Therefore:

P(maxXi2)=P(X12)P(X22)P(X32)=[P(X12)]3

The first equality comes from the independence of the Xi, and the second equality comes from the fact that the Xi are identically distributed. Now, the probability that X1 is less than or equal to 2 is:

P(X2)=P(X=1)+P(X=2)=(34)(14)11+(34)(14)21=34+316=1516

Therefore, the probability that the maximum of the Xi is less than or equal to 2 is:

P(maxXi2)=[P(X12)]3=(1516)3=0.824


Lesson 25: The Moment-Generating Function Technique

Lesson 25: The Moment-Generating Function Technique

Overview

In the previous lesson, we learned that the expected value of the sample mean X¯ is the population mean μ. We also learned that the variance of the sample mean X¯ is σ2n, that is, the population variance divided by the sample size n. We have not yet determined the probability distribution of the sample mean when, say, the random sample comes from a normal distribution with mean μ and variance σ2. We are going to tackle that in the next lesson! Before we do that, though, we are going to want to put a few more tools into our toolbox. We already have learned a few techniques for finding the probability distribution of a function of random variables, namely the distribution function technique and the change-of-variable technique. In this lesson, we'll learn yet another technique called the moment-generating function technique. We'll use the technique in this lesson to learn, among other things, the distribution of sums of chi-square random variables, Then, in the next lesson, we'll use the technique to find (finally) the probability distribution of the sample mean when the random sample comes from a normal distribution with mean μ and variance σ2.

Objectives

Upon completion of this lesson, you should be able to:

  • To refresh our memory of the uniqueness property of moment-generating functions.
  • To learn how to calculate the moment-generating function of a linear combination of n independent random variables.
  • To learn how to calculate the moment-generating function of a linear combination of n independent and identically distributed random variables.
  • To learn the additive property of independent chi-square random variables.
  • To use the moment-generating function technique to prove the additive property of independent chi-square random variables.
  • To understand the steps involved in each of the proofs in the lesson.
  • To be able to apply the methods learned in the lesson to new problems.

25.1 - Uniqueness Property of M.G.F.s

25.1 - Uniqueness Property of M.G.F.s

Recall that the moment generating function:

MX(t)=E(etX)

uniquely defines the distribution of a random variable. That is, if you can show that the moment generating function of X¯ is the same as some known moment-generating function, then X¯follows the same distribution. So, one strategy to finding the distribution of a function of random variables is:

  1. To find the moment-generating function of the function of random variables
  2. To compare the calculated moment-generating function to known moment-generating functions
  3. If the calculated moment-generating function is the same as some known moment-generating function of X, then the function of the random variables follows the same probability distribution as X

Example 25-1

two pennies

In the previous lesson, we looked at an example that involved tossing a penny three times and letting X1 denote the number of heads that we get in the three tosses. In the same example, we suggested tossing a second penny two times and letting X2 denote the number of heads we get in those two tosses. We let:

Y=X1+X2

denote the number of heads in five tosses. What is the probability distribution of Y?

Solution

We know that:

  • X1 is a binomial random variable with n=3 and p=12
  • X2 is a binomial random variable with n=2 and p=12

Therefore, based on what we know of the moment-generating function of a binomial random variable, the moment-generating function of X1 is:

MX1(t)=(12+12et)3

And, similarly, the moment-generating function of X2 is:

MX2(t)=(12+12et)2

Now, because X1 and X2 are independent random variables, the random variable Y is the sum of independent random variables. Therefore, the moment-generating function of Y is:

MY(t)=E(etY)=E(et(X1+X2))=E(etX1etX2)=E(etX1)E(etX2)

The first equality comes from the definition of the moment-generating function of the random variable Y. The second equality comes from the definition of Y. The third equality comes from the properties of exponents. And, the fourth equality comes from the expectation of the product of functions of independent random variables. Now, substituting in the known moment-generating functions of X1 and X2, we get:

MY(t)=(12+12et)3(12+12et)2=(12+12et)5

That is, Y has the same moment-generating function as a binomial random variable with n=5 and p=12. Therefore, by the uniqueness properties of moment-generating functions, Y must be a binomial random variable with n=5 and p=12. (Of course, we already knew that!)

It seems that we could generalize the way in which we calculated, in the above example, the moment-generating function of Y, the sum of two independent random variables. Indeed, we can! On the next page!


25.2 - M.G.F.s of Linear Combinations

25.2 - M.G.F.s of Linear Combinations

Theorem

If X1,X2,,Xn are n independent random variables with respective moment-generating functions MXi(t)=E(etXi) for i=1,2,,n, then the moment-generating function of the linear combination:

Y=i=1naiXi

is:

MY(t)=i=1nMXi(ait)

Proof

The proof is very similar to the calculation we made in the example on the previous page. That is:

MY(t)=E[etY]=E[et(a1X1+a2X2++anXn)]=E[ea1tX1]E[ea2tX2]E[eantXn]=MX1(a1t)MX2(a2t)MXn(ant)=i=1nMXi(ait)

The first equality comes from the definition of the moment-generating function of the random variable Y. The second equality comes from the given definition of Y. The third equality comes from the properties of exponents, as well as from the expectation of the product of functions of independent random variables. The fourth equality comes from the definition of the moment-generating function of the random variables Xi, for i=1,2,,n. And, the fifth equality comes from using product notation to write the product of the moment-generating functions.

While the theorem is useful in its own right, the following corollary is perhaps even more useful when dealing not just with independent random variables, but also random variables that are identically distributed — two characteristics that we get, of course, when we take a random sample.

Corollary

If X1,X2,,Xn are observations of a random sample from a population (distribution) with moment-generating function M(t), then:

  1. The moment generating function of the linear combination Y=i=1nXi is MY(t)=i=1nM(t)=[M(t)]n.
  2. The moment generating function of the sample mean X¯=i=1n(1n)Xi is MX¯(t)=i=1nM(tn)=[M(tn)]n.

Proof

  1. use the preceding theorem with ai=1 for i=1,2,,n
  2. use the preceding theorem with ai=1n for i=1,2,,n

Example 25-2

Let X1,X2, and X3 denote a random sample of size 3 from a gamma distribution with α=7 and θ=5. Let Y be the sum of the three random variables:

Y=X1+X2+X3

What is the distribution of Y?

Solution

The moment-generating function of a gamma random variable X with α=7 and θ=5 is:

MX(t)=1(15t)7

for t<15. Therefore, the corollary tells us that the moment-generating function of Y is:

MY(t)=[MX1(t)]3=(1(15t)7)3=1(15t)21

for t<15, which is the moment-generating function of a gamma random variable with α=21 and θ=5. Therefore, Y must follow a gamma distribution with α=21 and θ=5.

What is the distribution of the sample mean X¯?

Solution

Again, the moment-generating function of a gamma random variable X with α=7 and θ=5 is:

MX(t)=1(15t)7

for t<15. Therefore, the corollary tells us that the moment-generating function of X¯ is:

MX¯(t)=[MX1(t3)]3=(1(15(t/3))7)3=1(1(5/3)t)21

for t<35, which is the moment-generating function of a gamma random variable with α=21 and θ=53. Therefore, X¯ must follow a gamma distribution with α=21 and θ=53.


25.3 - Sums of Chi-Square Random Variables

25.3 - Sums of Chi-Square Random Variables

We'll now turn our attention towards applying the theorem and corollary of the previous page to the case in which we have a function involving a sum of independent chi-square random variables. The following theorem is often referred to as the "additive property of independent chi-squares."

Theorem

Let Xi denote n independent random variables that follow these chi-square distributions:

  • X1χ2(r1)
  • X2χ2(r2)
  • Xnχ2(rn)

Then, the sum of the random variables:

Y=X1+X2++Xn

follows a chi-square distribution with r1+r2++rn degrees of freedom. That is:

Yχ2(r1+r2++rn)

Proof

https://www.youtube.com/watch/Cb3b5gFqLRU [7]

We have shown that MY(t) is the moment-generating function of a chi-square random variable with r1+r2++rn degrees of freedom. That is:

Yχ2(r1+r2++rn)

as was to be shown.

Theorem

Let Z1,Z2,,Zn have standard normal distributions, N(0,1). If these random variables are independent, then:

W=Z12+Z22++Zn2

follows a χ2(n) distribution.

Proof

Recall that if ZiN(0,1), then Zi2χ2(1) for i=1,2,,n. Then, by the additive property of independent chi-squares:

W=Z12+Z22++Zn2χ2(1+1++1)=χ2(n)

That is, Wχ2(n), as was to be proved.

Corollary

If X1,X2,,Xn are independent normal random variables with different means and variances, that is:

XiN(μi,σi2)

for i=1,2,,n. Then:

W=i=1n(Xiμi)2σi2χ2(n)

Proof

Recall that:

Zi=(Xiμi)σiN(0,1)

Therefore:

W=i=1nZi2=i=1n(Xiμi)2σi2χ2(n)

as was to be proved.


Lesson 26: Random Functions Associated with Normal Distributions

Lesson 26: Random Functions Associated with Normal Distributions

Overview

In the previous lessons, we've been working our way up towards fully defining the probability distribution of the sample mean X¯ and the sample variance S2. We have determined the expected value and variance of the sample mean. Now, in this lesson, we (finally) determine the probability distribution of the sample mean and sample variance when a random sample X1,X2,,Xn is taken from a normal population (distribution). We'll also learn about a new probability distribution called the (Student's) t distribution.

Objectives

Upon completion of this lesson, you should be able to:

  • To learn the probability distribution of a linear combination of independent normal random variables X1,X2,,Xn.
  • To learn how to find the probability that a linear combination of independent normal random variables X1,X2,,Xn takes on a certain interval of values.
  • To learn the sampling distribution of the sample mean when X1,X2,,Xn are a random sample from a normal population with mean μ and variance σ2.
  • To use simulation to get a feel for the shape of a probability distribution.
  • To learn the sampling distribution of the sample variance when X1,X2,,Xn are a random sample from a normal population with mean μ and variance σ2.
  • To learn the formal definition of a T random variable.
  • To learn the characteristics of Student's t distribution.
  • To learn how to read a t-table to find t-values and probabilities associated with t-values.
  • To understand each of the steps in the proofs in the lesson.
  • To be able to apply the methods learned in this lesson to new problems.

26.1 - Sums of Independent Normal Random Variables

26.1 - Sums of Independent Normal Random Variables

Well, we know that one of our goals for this lesson is to find the probability distribution of the sample mean when a random sample is taken from a population whose measurements are normally distributed. Then, let's just get right to the punch line! Well, first we'll work on the probability distribution of a linear combination of independent normal random variables X1,X2,,Xn. On the next page, we'll tackle the sample mean!

Theorem

If X1,X2,,Xn >are mutually independent normal random variables with means μ1,μ2,,μn and variances σ12,σ22,,σn2, then the linear combination:

Y=i=1nciXi

follows the normal distribution:

N(i=1nciμi,i=1nci2σi2)

Proof

We'll use the moment-generating function technique to find the distribution of Y. In the previous lesson, we learned that the moment-generating function of a linear combination of independent random variables X1,X2,,Xn >is:

MY(t)=i=1nMXi(cit)

Now, recall that if XiN(μ,σ2), then the moment-generating function of Xi is:

MXi(t)=exp(μt+σ2t22)

Therefore, the moment-generating function of Y is:

MY(t)=i=1nMXi(cit)=i=1nexp[μi(cit)+σi2(cit)22]

Evaluating the product at each index i from 1 to n, and using what we know about exponents, we get:

MY(t)=exp(μ1c1t)exp(μ2c2t)exp(μncnt)exp(σ12c12t22)exp(σ22c22t22)exp(σn2cn2t22)

Again, using what we know about exponents, and rewriting what we have using summation notation, we get:

MY(t)=exp[t(i=1nciμi)+t22(i=1nci2σi2)]

Ahaaa! We have just shown that the moment-generating function of Y is the same as the moment-generating function of a normal random variable with mean:

i=1nciμi

and variance:

i=1nci2σi2

Therefore, by the uniqueness property of moment-generating functions, Y must be normally distributed with the said mean and said variance. Our proof is complete.

Example 26-1

Let X1 be a normal random variable with mean 2 and variance 3, and let X2 be a normal random variable with mean 1 and variance 4. Assume that X1 and X2 are independent. What is the distribution of the linear combination Y=2X1+3X2?

Solution

The previous theorem tells us that Y is normally distributed with mean 7 and variance 48 as the following calculation illustrates:

(2X1+3X2)N(2(2)+3(1),22(3)+32(4))=N(7,48)

What is the distribution of the linear combination Y=X1X2?

Solution

The previous theorem tells us that Y is normally distributed with mean 1 and variance 7 as the following calculation illustrates:

(X1X2)N(21,(1)2(3)+(1)2(4))=N(1,7)

Example 26-2

multiple choice test

History suggests that scores on the Math portion of the Standard Achievement Test (SAT) are normally distributed with a mean of 529 and a variance of 5732. History also suggests that scores on the Verbal portion of the SAT are normally distributed with a mean of 474 and a variance of 6368. Select two students at random. Let X denote the first student's Math score, and let Y denote the second student's Verbal score. What is P(X>Y)?

Solution

We can find the requested probability by noting that P(X>Y)=P(XY>0), and then taking advantage of what we know about the distribution of XY. That is, XY is normally distributed with a mean of 55 and variance of 12100 as the following calculation illustrates:

(XY)N(529474,(1)2(5732)+(1)2(6368))=N(55,12100)

Then, finding the probability that X is greater than Y reduces to a normal probability calculation:

P(X>Y)=P(XY>0)=P(Z>05512100)=P(Z>12)=P(Z<12)=0.6915

That is, the probability that the first student's Math score is greater than the second student's Verbal score is 0.6915.

Example 26-3

carrots

Let Xi denote the weight of a randomly selected prepackaged one-pound bag of carrots. Of course, one-pound bags of carrots won't weigh exactly one pound. In fact, history suggests that Xi is normally distributed with a mean of 1.18 pounds and a standard deviation of 0.07 pound.

Now, let W denote the weight of randomly selected prepackaged three-pound bag of carrots. Three-pound bags of carrots won't weigh exactly three pounds either. In fact, history suggests that W is normally distributed with a mean of 3.22 pounds and a standard deviation of 0.09 pound.

Selecting bags at random, what is the probability that the sum of three one-pound bags exceeds the weight of one three-pound bag?

Solution

Because the bags are selected at random, we can assume that X1,X2,X3 and W are mutually independent. The theorem helps us determine the distribution of Y, the sum of three one-pound bags:

Y=(X1+X2+X3)N(1.18+1.18+1.18,0.072+0.072+0.072)=N(3.54,0.0147)

That is, Y is normally distributed with a mean of 3.54 pounds and a variance of 0.0147. Now, YW, the difference in the weight of three one-pound bags and one three-pound bag is normally distributed with a mean of 0.32 and a variance of 0.0228, as the following calculation suggests:

(YW)N(3.543.22,(1)2(0.0147)+(1)2(0.092))=N(0.32,0.0228)

Therefore, finding the probability that Y is greater than W reduces to a normal probability calculation:

P(Y>W)=P(YW>0)=P(Z>00.320.0228)=P(Z>2.12)=P(Z<2.12)=0.9830

That is, the probability that the sum of three one-pound bags exceeds the weight of one three-pound bag is 0.9830. Hey, if you want more bang for your buck, it looks like you should buy multiple one-pound bags of carrots, as opposed to one three-pound bag!


26.2 - Sampling Distribution of Sample Mean

26.2 - Sampling Distribution of Sample Mean

Okay, we finally tackle the probability distribution (also known as the "sampling distribution") of the sample mean when X1,X2,,Xn are a random sample from a normal population with mean μ and variance σ2. The word "tackle" is probably not the right choice of word, because the result follows quite easily from the previous theorem, as stated in the following corollary.

Corollary

If X1,X2,,Xn are observations of a random sample of size n from a N(μ,σ2) population, then the sample mean:

X¯=1ni=1nXi

is normally distributed with mean μ and variance σ2n. That is, the probability distribution of the sample mean is:

N(μ,σ2/n)

Proof

The result follows directly from the previous theorem. All we need to do is recognize that the sample mean:

X¯=X1+X2++Xnn

is a linear combination of independent normal random variables:

X¯=1nX1+1nX2++1nXn

with ci=1n, the mean μi=μ and the variance σi2=σ2. That is, the moment generating function of the sample mean is then:

MX¯(t)=exp[t(i=1nciμi)+t22(i=1nci2σi2)]=exp[t(i=1n1nμ)+t22(i=1n(1n)2σ2)]

The first equality comes from the theorem on the previous page, about the distribution of a linear combination of independent normal random variables. The second equality comes from simply replacing ci with 1n, the mean μi with μ and the variance σi2 with σ2. Now, working on the summations, the moment generating function of the sample mean reduces to:

MX¯(t)=exp[t(1ni=1nμ)+t22(1n2i=1nσ2)]=exp[t(1n(nμ))+t22(1n2(nσ2))]=exp[μt+t22(σ2n)]

The first equality comes from pulling the constants depending on n through the summation signs. The second equality comes from adding μ up n times to get nμ, and adding σ2 up n times to get nσ2. The last equality comes from simplifying a bit more. In summary, we have shown that the moment generating function of the sample mean of n independent normal random variables with mean μ and variance σ2 is:

MX¯(t)=exp[μt+t22(σ2n)]

That is the same as the moment generating function of a normal random variable with mean μ and variance σ2n. Therefore, the uniqueness property of moment-generating functions tells us that the sample mean must be normally distributed with mean μ and variance σ2n. Our proof is complete.

Example 26-4

Let Xi denote the Stanford-Binet Intelligence Quotient (IQ) of a randomly selected individual, i=1,,4 (one sample). Let Yi denote the IQ of a randomly selected individual, i=1,,8 (a second sample). Recalling that IQs are normally distributed with mean μ=100 and variance σ2=162, what is the distribution of X¯? And, what is the distribution of Y¯?

Anwser

In general, the variance of the sample mean is:

Var(X¯)=σ2n

Therefore, the variance of the sample mean of the first sample is:

Var(X¯4)=1624=64

(The subscript 4 is there just to remind us that the sample mean is based on a sample of size 4.) And, the variance of the sample mean of the second sample is:

Var(Y¯8=1628=32

(The subscript 8 is there just to remind us that the sample mean is based on a sample of size 8.) Now, the corollary therefore tells us that the sample mean of the first sample is normally distributed with mean 100 and variance 64. That is:

X¯4N(100,64)

And, the sample mean of the second sample is normally distributed with mean 100 and variance 32. That is:

Y¯8N(100,32)

So, we have two, no actually, three normal random variables with the same mean, but difference variances:

  • We have Xi, an IQ of a random individual. It is normally distributed with mean 100 and variance 256.
  • We have X¯4, the average IQ of 4 random individuals. It is normally distributed with mean 100 and variance 64.
  • We have Y¯8, the average IQ of 8 random individuals. It is normally distributed with mean 100 and variance 32.

It is quite informative to graph these three distributions on the same plot. Doing so, we get:

n=8 n=4 n=1 0 1 2 3 4 0.0 0.2 0.4 0.0 0.2 0.4 0.2 0.4 Normal density IQ

As the plot suggests, an individual Xi, the mean (\bar{X}_4\) and the mean Y¯8 all provide valid, "unbiased" estimates of the population mean μ. But, our intuition coincides with reality... that is, the sample mean Y¯8 will be the most precise estimate of μ.

All the work that we have done so far concerning this example has been theoretical in nature. That is, what we have learned is based on probability theory. Would we see the same kind of result if we were take to a large number of samples, say 1000, of size 4 and 8, and calculate the sample mean of each sample? That is, would the distribution of the 1000 sample means based on a sample of size 4 look like a normal distribution with mean 100 and variance 64? And would the distribution of the 1000 sample means based on a sample of size 8 look like a normal distribution with mean 100 and variance 32? Well, the only way to answer these questions is to try it out!

I did just that for us. I used Minitab to generate 1000 samples of eight random numbers from a normal distribution with mean 100 and variance 256. Here's a subset of the resulting random numbers:

 

ROW X1 X2 X3 X4 X5 X6 X7 X8 Mean 4 Mean 8
1 87 68 98 114 59 111 114 86 91.75 92.125
2 102 81 74 110 112 106 105 99 91.75 98.625
3 96 87 50 88 69 107 94 83 80.25 84.250
4 83 134 122 80 117 110 115 158 104.75 114.875
5 92 87 120 93 90 111 95 92 98.00 97.500
6 139 102 100 103 111 62 78 73 111.00 96.000
7 134 121 99 118 108 106 103 91 118.00 110.000
8 126 92 148 131 99 106 143 128 124.25 121.625
9 98 109 119 110 124 99 119 82 109.00 107.500
10 85 93 82 106 93 109 100 95 91.50 95.375
11 121 103 108 96 112 117 93 112 107.00 107.750
12 118 91 106 108 128 96 65 85 105.75 99.625
13 92 87 96 81 86 105 91 104 89.00 92.750
14 94 115 59 105 101 122 97 103 93.25 99.500
 
 ...and so on... 
 
975 108 139 130 97 138 88 104 87 118.50 111.375
976 99 122 93 107 98 62 102 115 105.25 99.750
977 99 127 91 101 127 79 81 121 104.50 103.250
978 120 108 101 104 90 90 191 104 108.25 101.000
979 101 93 106 113 115 82 96 97 103.25 100.375
980 118 86 74 95 109 111 90 83 93.25 95.750
981 118 95 121 124 111 90 105 112 114.50 109.500
982 110 121 85 117 91 84 84 108 108.25 100.000
983 95 109 118 112 121 105 84 115 108.50 107.375
984 102 105 127 104 95 101 106 103 109.50 105.375
985 116 93 112 102 67 92 103 114 105.75 99.875
986 106 97 114 82 82 108 113 81 99.75 97.875
987 107 93 78 91 83 81 115 102 92.25 93.750
988 106 115 105 74 86 124 97 116 100.00 102.875
989 117 84 131 102 92 118 90 90 108.50 103.000
990 100 69 108 128 111 110 94 95 101.25 101.875
991 86 85 123 94 104 89 76 97 97.00 94.250
992 94 90 72 121 105 150 72 88 94.25 99.000
993 70 109 104 114 93 103 126 99 99.25 102.250
994 102 110 98 93 64 131 91 95 100.75 98.000
995 80 135 120 92 118 119 66 117 106.75 105.875
996 81 102 88 98 113 81 95 110 92.25 96.000
997 85 146 73 133 111 88 92 74 109.25 100.250
998 94 109 110 115 95 93 90 103 107.00 101.125
999 84 84 97 125 92 89 95 124 97.50 98.750
1000 77 60 113 106 107 109 110 103 89.00 98.125

As you can see, the second last column, titled Mean4, is the average of the first four columns X1 X2, X3, and X4. The last column, titled Mean8, is the average of the first eight columns X1, X2, X3, X4, X5, X6, X7, and X8. Now, all we have to do is create a histogram of the sample means appearing in the Mean4 column:

708090100110120130050100FrequencyMean of X-bar (with n=4)

Ahhhh! The histogram sure looks fairly bell-shaped, making the normal distribution a real possibility. Now, recall that the Empirical Rule tells us that we should expect, if the sample means are normally distributed, that almost all of the sample means would fall within three standard deviations of the population mean. That is, in the case of Mean4, we should expect almost all of the data to fall between 76 (from 100−3(8)) and 124 (from 100+3(8)). It sure looks like that's the case!

Let's do the same thing for the Mean8 column. That is, let's create a histogram of the sample means appearing in the Mean8 column. Doing so, we get:

708090100110120130050100FrequencyMean of X-bar (with n=89)

Again, the histogram sure looks fairly bell-shaped, making the normal distribution a real possibility. In this case, the Empirical Rule tells us that, in the case of Mean8, we should expect almost all of the data to fall between 83 (from 100−3(square root of 32)) and 117 (from 100+3(square root of 32)). It too looks pretty good on both sides, although it seems that there were two really extreme sample means of size 8. (If you look back at the data, you can see one of them in the eighth row.)

In summary, the whole point of this exercise was to use the theory to help us derive the distribution of the sample mean of IQs, and then to use real simulated normal data to see if our theory worked in practice. I think we can conclude that it does!


26.3 - Sampling Distribution of Sample Variance

26.3 - Sampling Distribution of Sample Variance

Now that we've got the sampling distribution of the sample mean down, let's turn our attention to finding the sampling distribution of the sample variance. The following theorem will do the trick for us!

Theorem
  • X1,X2,,Xn are observations of a random sample of size n from the normal distribution N(μ,σ2)
  • X¯=1ni=1nXi is the sample mean of the n observations, and
  • S2=1n1i=1n(XiX¯)2 is the sample variance of the n observations.

Then:

  1. X¯and S2 are independent
  2. (n1)S2σ2=i=1n(XiX¯)2σ2χ2(n1)

Proof

The proof of number 1 is quite easy. Errr, actually not! It is quite easy in this course, because it is beyond the scope of the course. So, we'll just have to state it without proof.

Now for proving number 2. This is one of those proofs that you might have to read through twice... perhaps reading it the first time just to see where we're going with it, and then, if necessary, reading it again to capture the details. We're going to start with a function which we'll call W:

W=i=1n(Xiμσ)2

Now, we can take W and do the trick of adding 0 to each term in the summation. Doing so, of course, doesn't change the value of W:

W=i=1n((XiX¯)+(X¯μ)σ)2

As you can see, we added 0 by adding and subtracting the sample mean to the quantity in the numerator. Now, let's square the term. Doing just that, and distributing the summation, we get:

W=i=1n(XiX¯σ)2+i=1n(X¯μσ)2+2(X¯μσ2)i=1n(XiX¯)

But the last term is 0:

W=i=1n(XiX¯σ)2+i=1n(X¯μσ)2+2(X¯μσ2)i=1n(XiX¯)0,since(XiX¯)=nX¯nX¯=0

so, W reduces to:

W=i=1n(XiX¯)2σ2+n(X¯μ)2σ2

We can do a bit more with the first term of W. As an aside, if we take the definition of the sample variance:

S2=1n1i=1n(XiX¯)2

and multiply both sides by (n1), we get:

(n1)S2=i=1n(XiX¯)2

So, the numerator in the first term of W can be written as a function of the sample variance. That is:

W=i=1n(Xiμσ)2=(n1)S2σ2+n(X¯μ)2σ2

Okay, let's take a break here to see what we have. We've taken the quantity on the left side of the above equation, added 0 to it, and showed that it equals the quantity on the right side. Now, what can we say about each of the terms. Well, the term on the left side of the equation:

i=1n(Xiμσ)2

is a sum of n independent chi-square(1) random variables. That's because we have assumed that X1,X2,,Xn are observations of a random sample of size n from the normal distribution N(μ,σ2). Therefore:

Xiμσ

follows a standard normal distribution. Now, recall that if we square a standard normal random variable, we get a chi-square random variable with 1 degree of freedom. So, again:

i=1n(Xiμσ)2

is a sum of n independent chi-square(1) random variables. Our work from the previous lesson then tells us that the sum is a chi-square random variable with n degrees of freedom. Therefore, the moment-generating function of W is the same as the moment-generating function of a chi-square(n) random variable, namely:

MW(t)=(12t)n/2

for t<12. Now, the second term of W, on the right side of the equals sign, that is:

n(X¯μ)2σ2

is a chi-square(1) random variable. That's because the sample mean is normally distributed with mean μ and variance σ2n. Therefore:

Z=X¯μσ/nN(0,1)

is a standard normal random variable. So, if we square Z, we get a chi-square random variable with 1 degree of freedom:

Z2=n(X¯μ)2σ2χ2(1)

And therefore the moment-generating function of Z2 is:

MZ2(t)=(12t)1/2

for t<12. Let's summarize again what we know so far. W is a chi-square(n) random variable, and the second term on the right is a chi-square(1) random variable:

W

Now, let's use the uniqueness property of moment-generating functions. By definition, the moment-generating function of W is:

MW(t)=E(etW)=E[et((n1)S2/σ2+Z2)]

Using what we know about exponents, we can rewrite the term in the expectation as a product of two exponent terms:

E(etW)=E[et((n1)S2/σ2)etZ2]=M(n1)S2/σ2(t)MZ2(t)

The last equality in the above equation comes from the independence between X¯ and S2. That is, if they are independent, then functions of them are independent. Now, let's substitute in what we know about the moment-generating function of W and of Z2. Doing so, we get:

(12t)n/2=M(n1)S2/σ2(t)(12t)1/2

Now, let's solve for the moment-generating function of (n1)S2σ2, whose distribution we are trying to determine. Doing so, we get:

M(n1)S2/σ2(t)=(12t)n/2(12t)1/2

Adding the exponents, we get:

M(n1)S2/σ2(t)=(12t)(n1)/2

for t<12. But, oh, that's the moment-generating function of a chi-square random variable with n1 degrees of freedom. Therefore, the uniqueness property of moment-generating functions tells us that (n1)S2σ2 must be a a chi-square random variable with n1 degrees of freedom. That is:

(n1)S2σ2=i=1n(XiX¯)2σ2χ(n1)2

as was to be proved! And, to just think that this was the easier of the two proofs

Before we take a look at an example involving simulation, it is worth noting that in the last proof, we proved that, when sampling from a normal distribution:

i=1n(Xiμ)2σ2χ2(n)

but:

i=1n(XiX¯)2σ2=(n1)S2σ2χ2(n1)

The only difference between these two summations is that in the first case, we are summing the squared differences from the population mean μ, while in the second case, we are summing the squared differences from the sample mean X¯. What happens is that when we estimate the unknown population mean μ withX¯ we "lose" one degreee of freedom. This is generally true... a degree of freedom is lost for each parameter estimated in certain chi-square random variables.

Example 26-5

Let's return to our example concerning the IQs of randomly selected individuals. Let Xi denote the Stanford-Binet Intelligence Quotient (IQ) of a randomly selected individual, i=1,,8. Recalling that IQs are normally distributed with mean μ=100 and variance σ2=162, what is the distribution of (n1)S2σ2?

Solution

Because the sample size is n=8, the above theorem tells us that:

(81)S2σ2=7S2σ2=i=18(XiX¯)2σ2

follows a chi-square distribution with 7 degrees of freedom. Here's what the theoretical density function would look like:

01020300.000.050.10Chi (7)X

Again, all the work that we have done so far concerning this example has been theoretical in nature. That is, what we have learned is based on probability theory. Would we see the same kind of result if we were take to a large number of samples, say 1000, of size 8, and calculate:

i=18(XiX¯)2256

for each sample? That is, would the distribution of the 1000 resulting values of the above function look like a chi-square(7) distribution? Again, the only way to answer this question is to try it out! I did just that for us. I used Minitab to generate 1000 samples of eight random numbers from a normal distribution with mean 100 and variance 256. Here's a subset of the resulting random numbers:

data
click to enlarge [8]

As you can see, the last column, titled FnofSsq (for function of sums of squares), contains the calculated value of:

i=18(XiX¯)2256

based on the random numbers generated in columns X1 X2, X3, X4, X5, X6, X7, and X8. For example, given that the average of the eight numbers in the first row is 98.625, the value of FnofSsq in the first row is:

1256[(9898.625)2+(7798.625)2++(9198.625)2]=5.7651

Now, all we have to do is create a histogram of the values appearing in the FnofSsq column. Doing so, we get:

01020300.000.050.100.15DensityFnofSsq

Hmm! The histogram sure looks eerily similar to that of the density curve of a chi-square random variable with 7 degrees of freedom. It looks like the practice is meshing with the theory!


26.4 - Student's t Distribution

26.4 - Student's t Distribution

We have just one more topic to tackle in this lesson, namely, Student's t distribution. Let's just jump right in and define it!

Definition. If ZN(0,1) and Uχ2(r) are independent, then the random variable:

T=ZU/r

follows a t-distribution with r degrees of freedom. We write Tt(r). The p.d.f. of T is:

f(t)=Γ((r+1)/2)πrΓ(r/2)1(1+t2/r)(r+1)/2

for <t<.

By the way, the t distribution was first discovered by a man named W.S. Gosset. He discovered the distribution when working for an Irish brewery. Because he published under the pseudonym Student, the t distribution is often called Student's t distribution.

History aside, the above definition is probably not particularly enlightening. Let's try to get a feel for the t distribution by way of simulation. Let's randomly generate 1000 standard normal values (Z) and 1000 chi-square(3) values (U). Then, the above definition tells us that, if we take those randomly generated values, calculate:

T=ZU/3

and create a histogram of the 1000 resulting T values, we should get a histogram that looks like a t distribution with 3 degrees of freedom. Well, here's a subset of the resulting values from one such simulation:

ROW Z CHISQ (3) T(3)
1 -2.60481 10.2497 -1.4092
2 2.92321 1.6517 3.9396
3 -0.48633 0.1757 -2.0099
4 -0.48212 3.8283 -0.4268
5 -0.04150 0.2422 -0.1461
6 -0.84225 -0.0903 -4.8544
7 -0.31205 1.6326 -0.4230
8 1.33068 5.2224 1.0086
9 -0.64104 0.9401 -1.1451
10 -0.05110 2.2632 -0.0588
11 1.61601 4.6566 1.2971
12 0.81522 2.1738 0.9577
13 0.38501 1.8404 0.4916
14 -1.63426 1.1265 -2.6669
 
...and so on...
 
994 -0.18942 3.5202 -0.1749
995 0.43078 3.3585 0.4071
996 -0.14068 0.6236 -0.3085
997 -1.76357 2.6188 -1.8876
998 -1.02310 3.2470 -0.9843
999 -0.93777 1.4991 -1.3266
1000 -0.37665 2.1231 -0.4477

 

Note, for example, in the first row:

T(3)=2.6048110.2497/3=1.4092

 

Here's what the resulting histogram of the 1000 randomly generated T(3) values looks like, with a standard N(0,1) curve superimposed:

-9-6-30369050100150200250FrequencyHistogram of TNormalN(0,1) curveMean - 0StDev - 1N - 1000T

 

Hmmm. The t-distribution seems to be quite similar to the standard normal distribution. Using the formula given above for the p.d.f. of T, we can plot the density curve of various t random variables, say when r=1,r=4, and r=7, to see that that is indeed the case:

-4-3-2-1012340.00.10.20.30.4DensityN(0,1)t(7)t(4)t(1)

 

In fact, it looks as if, as the degrees of freedom r increases, the t density curve gets closer and closer to the standard normal curve. Let's summarize what we've learned in our little investigation about the characteristics of the t distribution:

  1. The support appears to be <t<. (It is!)
  2. The probability distribution appears to be symmetric about t=0. (It is!)
  3. The probability distribution appears to be bell-shaped. (It is!)
  4. The density curve looks like a standard normal curve, but the tails of the t-distribution are "heavier" than the tails of the normal distribution. That is, we are more likely to get extreme t-values than extreme z-values.
  5. As the degrees of freedom r increases, the t-distribution appears to approach the standard normal z-distribution. (It does!)

As you'll soon see, we'll need to look up t-values, as well as probabilities concerning T random variables, quite often in Stat 415. Therefore, we better make sure we know how to read a t table.

The t Table

If you take a look at Table VI in the back of your textbook, you'll find what looks like a typical t table. Here's what the top of Table VI looks like (well, minus the shading that I've added):

-3-2-1123t-3-2-1123t0.20.10.20.10.40.30.40.3Table VI The t Distribution

P(Tt)=tΓ[(r+1)/2]πrΓ(r/2)(1+w2/r)(r+1)/2dw

P(Tt)=1P(Tt)

P(Tt)
  0.60 0.75 0.90 0.95 0.975 0.99 0.995
r t0.40(r) t0.25(r) t0.10(r) t0.05(r) t0.025(r) t0.01(r) t0.005(r)
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032
               
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169

 

The t-table is similar to the chi-square table in that the inside of the t-table (shaded in purple) contains the t-values for various cumulative probabilities (shaded in red), such as 0.60, 0.75, 0.90, 0.95, 0.975, 0.99, and 0.995, and for various t distributions with r degrees of freedom (shaded in blue). The row shaded in green indicates the upper α probability that corresponds to the 1α cumulative probability. For example, if you're interested in either a cumulative probability of 0.60, or an upper probability of 0.40, you'll want to look for the t-value in the first column.

Let's use the t-table to read a few probabilities and t-values off of the table:

Let's take a look at a few more examples.

Example 26-6

Let T follow a t-distribution with r=8 df. What is the probability that the absolute value of T is less than 2.306?

Solution

The probability calculation is quite similar to a calculation we'd have to make for a normal random variable. First, rewriting the probability in terms of T instead of the absolute value of T, we get:

P(|T|<2.306)=P(2.306<T<2.306)

 

Then, we have to rewrite the probability in terms of cumulative probabilities that we can actually find, that is:

P(|T|<2.306)=P(T<2.306)P(T<2.306)

 

Pictorially, the probability we are looking for looks something like this:

T (8)-2.30602.306

But the t-table doesn't contain negative t-values, so we'll have to take advantage of the symmetry of the T distribution. That is:

 >P(|T|<2.306)=P(T<2.306)P(T>2.306) 

 
Can you find the necessary t-values on the t-table?
P(Tt)
  0.60 0.75 0.90 0.95 0.975 0.99 0.995
r t0.40(r) t0.25(r) t0.10(r) t0.05(r) t0.025(r) t0.01(r) t0.005(r)
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032
               
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169
P(Tt)
  0.60 0.75 0.90 0.95 0.975 0.99 0.995
r t0.40(r) t0.25(r) t0.10(r) t0.05(r) t0.025(r) t0.01(r) t0.005(r)
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032
               
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169

The t-table tells us that P(T<2.306)=0.975 and P(T>2.306)=0.025.  Therefore:

P(|T|>2.306)=0.9750.025=0.95

What is t0.05(8)?

Solution

The value t0.05(8) is the value t0.05 such that the probability that a T random variable with 8 degrees of freedom is greater than the value t0.05 is 0.05. That is:

T(8)0t0.050.05

Can you find the value t0.05 on the t-table?

P(Tt)
  0.60 0.75 0.90 0.95 0.975 0.99 0.995
r t0.40(r) t0.25(r) t0.10(r) t0.05(r) t0.025(r) t0.01(r) t0.005(r)
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032
               
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169
P(Tt)
  0.60 0.75 0.90 0.95 0.975 0.99 0.995
r t0.40(r) t0.25(r) t0.10(r) t0.05(r) t0.025(r) t0.01(r) t0.005(r)
1 0.325 1.000 3.078 6.314 12.706 31.821 63.657
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
3 0.277 0.765 1.638 2.353 3.182 4.541 5.841
4 0.271 0.741 1.533 2.132 2.776 3.747 4.604
5 0.267 0.727 1.476 2.015 2.571 3.365 4.032
               
6 0.265 0.718 1.440 1.943 2.447 3.143 3.707
7 0.263 0.711 1.415 1.895 2.365 2.998 3.499
8 0.262 0.706 1.397 1.860 2.306 2.896 3.355
9 0.261 0.703 1.383 1.833 2.262 2.821 3.250
10 0.260 0.700 1.372 1.812 2.228 2.764 3.169

We have determined that the probability that a T random variable with 8 degrees of freedom is greater than the value 1.860 is 0.05.

Why will we encounter a T random variable?

Given a random sample X1,X2,,Xn from a normal distribution, we know that:

Z=X¯μσ/nN(0,1)

Earlier in this lesson, we learned that:

U=(n1)S2σ2

follows a chi-square distribution with n1 degrees of freedom. We also learned that Z and U are independent. Therefore, using the definition of a T random variable, we get:

It is the resulting quantity, that is:

T=X¯μs/n

that will help us, in Stat 415, to use a mean from a random sample, that is X¯, to learn, with confidence, something about the population mean μ.


Lesson 27: The Central Limit Theorem

Lesson 27: The Central Limit Theorem

Introduction

In the previous lesson, we investigated the probability distribution ("sampling distribution") of the sample mean when the random sample X1,X2,,Xn comes from a normal population with mean μ and variance σ2, that is, when XiN(μ,σ2),i=1,2,,n. Specifically, we learned that if Xi, i=1,2,,n, is a random sample of size n from a N(μ,σ2) population, then:

X¯N(μ,σ2n)

But what happens if the Xi follow some other non-normal distribution? For example, what distribution does the sample mean follow if the Xi come from the Uniform(0, 1) distribution? Or, what distribution does the sample mean follow if the Xi come from a chi-square distribution with three degrees of freedom? Those are the kinds of questions we'll investigate in this lesson. As the title of this lesson suggests, it is the Central Limit Theorem that will give us the answer.

Objectives

Upon completion of this lesson, you should be able to:

  • To learn the Central Limit Theorem.
  • To get an intuitive feeling for the Central Limit Theorem.
  • To use the Central Limit Theorem to find probabilities concerning the sample mean.
  • To be able to apply the methods learned in this lesson to new problems.

27.1 - The Theorem

27.1 - The Theorem

Central Limit Theorem

We don't have the tools yet to prove the Central Limit Theorem, so we'll just go ahead and state it without proof.

Let X1,X2,,Xn be a random sample from a distribution (any distribution!) with (finite) mean μ and (finite) variance σ2. If the sample size n is "sufficiently large," then:

  1. the sample mean X¯ follows an approximate normal distribution

  2. with mean E(X¯)=μX¯=μ

  3. and variance Var(X¯)=σX¯2=σ2n

We write:

X¯dN(μ,σ2n) as n

or:

Z=X¯μσ/n=i=1nXinμnσdN(0,1) as n.

So, in a nutshell, the Central Limit Theorem (CLT) tells us that the sampling distribution of the sample mean is, at least approximately, normally distributed, regardless of the distribution of the underlying random sample. In fact, the CLT applies regardless of whether the distribution of the Xi is discrete (for example, Poisson or binomial) or continuous (for example, exponential or chi-square). Our focus in this lesson will be on continuous random variables. In the next lesson, we'll apply the CLT to discrete random variables, such as the binomial and Poisson random variables.

You might be wondering why "sufficiently large" appears in quotes in the theorem. Well, that's because the necessary sample size n depends on the skewness of the distribution from which the random sample Xi comes:

  1. If the distribution of the Xi is symmetric, unimodal or continuous, then a sample size n as small as 4 or 5 yields an adequate approximation.
  2. If the distribution of the Xi is skewed, then a sample size n of at least 25 or 30 yields an adequate approximation.
  3. If the distribution of the Xi is extremely skewed, then you may need an even larger n.

We'll spend the rest of the lesson trying to get an intuitive feel for the theorem, as well as applying the theorem so that we can calculate probabilities concerning the sample mean.


27.2 - Implications in Practice

27.2 - Implications in Practice

As stated on the previous page, we don't yet have the tools to prove the Central Limit Theorem. And, we won't actually get to proving it until late in Stat 415. It would be good though to get an intuitive feel now for how the CLT works in practice. On this page, we'll explore two examples to get a feel for how: [11]

  1. the skewness (or symmetry!) of the underlying distribution of Xi, and
  2. the sample size n

affect how well the normal distribution approximates the actual ("exact") distribution of the sample mean X¯. Well, that's not quite true. We won't actually find the exact distribution of the sample mean in the two examples. We'll instead use simulation to do the work for us. In the first example, we'll take a look at sample means drawn from a symmetric distribution, specifically, the Uniform(0,1) distribution. In the second example, we'll take a look at sample means drawn from a highly skewed distribution, specifically, the chi-square(3) distribution. In each case, we'll see how large the sample size n has to get before the normal distribution does a decent job of approximating the simulated distribution.

Example 27-1

Consider taking random samples of various sizes n from the (symmetric) Uniform (0, 1) distribution. At what sample size n does the normal distribution make a good approximation to the actual distribution of the sample mean?

Solution

Our previous work on the continuous Uniform(0, 1) random variable tells us that the mean of a U(0,1) random variable is:

μ=E(Xi)=0+12=12

while the variance of a U(0,1) random variable is:

σ2=Var(Xi)=(10)212=112

The Central Limit Theorem, therefore, tells us that the sample mean X¯ is approximately normally distributed with mean:

μX¯=μ=12

and variance:

σX¯2=σ2n=1/12n=112n

Now, our end goal is to compare the normal distribution, as defined by the CLT, to the actual distribution of the sample mean. Now, we could do a lot of theoretical work to find the exact distribution of X¯ for various sample sizes n. Instead, we'll use simulation to give us a ballpark idea of the shape of the distribution of X¯. Here's an outline of the general strategy that we'll follow:

  1. Specify the sample size n.
  2. Randomly generate 1000 samples of size n from the Uniform (0,1) distribution.
  3. Use the 1000 generated samples to calculate 1000 sample means from the Uniform (0,1) distribution.
  4. Create a histogram of the 1000 sample means.
  5. Compare the histogram to the normal distribution, as defined by the Central Limit Theorem, in order to see how well the Central Limit Theorem works for the given sample size n.

Let's start with a sample size of n=1. That is, randomly sample 1000 numbers from a Uniform (0,1) distribution, and create a histogram of the 1000 generated numbers. Of course, the histogram should look roughly flat like a Uniform(0,1) distribution. If you're willing to ignore the artifacts of sampling, you can see that our histogram is roughly flat:

0.000.160.320.480.640.800.960.00.20.40.60.81.01.21.4DensityHistogram of X2X2

Okay, now let's tackle the more interesting sample sizes. Let n=2. Generating 1000 samples of size n=2, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

0.1250.2500.3750.5000.5250.7500.8750.00.51.01.52.02.5DensityHistogram of Mean 2NormalMean 2

It can actually be shown that the exact distribution of the sample mean of 2 numbers drawn from the Uniform(0, 1) distribution is the triangular distribution. The histogram does look a bit triangular, doesn't it? The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=12

and variance:

σX¯2=112n=112(2)=124

As you can see, already at n=2, the normal curve wouldn't do too bad of a job of approximating the exact probabilities. Let's increase the sample size to n=4. Generating 1000 samples of size n=4, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

0.1250.2500.3750.5000.5250.7500.8750.00.51.01.52.03.02.5DensityHistogram of Mean 4NormalMean 4

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=12

and variance:

σX¯2=112n=112(4)=148

Again, at n=4, the normal curve does a very good job of approximating the exact probabilities. In fact, it does such a good job, that we could probably stop this exercise already. But let's increase the sample size to n=9. Generating 1000 samples of size n=9, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

0.1250.2500.3750.5000.5250.7500.87501234DensityHistogram of Mean 9NormalMean 9

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=12

and variance:

σX¯2=112n=112(9)=1108

And not surprisingly, at n=9, the normal curve does a very good job of approximating the exact probabilities. There is another interesting thing worth noting though, too. As you can see, as the sample size increases, the variance of the sample mean decreases. That's a good thing, as it doesn't seem that it should be any other way. If you think about it, if it were possible to increase the sample size n to something close to the size of the population, you would expect that the resulting sample means would not vary much, and would be close to the population mean. Of course, the trade-off here is that large sample sizes typically cost lots more money than small sample sizes.

Well, just for the heck of it, let's increase our sample size one more time to n=16. Generating 1000 samples of size n=16, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

0.300.360.420.480.540.600.660123456DensityHistogram of Mean 16NormalMean 16

The blue curve overlaid on the histogram is the normal distribution with mean:

μX¯=μ=12

and variance:

σX¯2=112n=112(16)=1192

Again, at n=16, the normal curve does a very good job of approximating the exact probabilities. Okay, uncle! That's enough of this example! Let's summarize the two take-away messages from this example:

  1. If the underlying distribution is symmetric, then you don't need a very large sample size for the normal distribution, as defined by the Central Limit Theorem, to do a decent job of approximating the probability distribution of the sample mean.
  2. The larger the sample size n, the smaller the variance of the sample mean.

Example 27-2

Now consider taking random samples of various sizes n from the (skewed) chi-square distribution with 3 degrees of freedom. At what sample size n does the normal distribution make a good approximation to the actual distribution of the sample mean?

Solution

We are going to do exactly what we did in the previous example. The only difference is that our underlying distribution here, that is, the chi-square(3) distribution, is highly-skewed. Now, our previous work on the chi-square distribution tells us that the mean of a chi-square random variable with three degrees of freedom is:

μ=E(Xi)=r=3

while the variance of a chi-square random variable with three degrees of freedom is:

σ2=Var(Xi)=2r=2(3)=6

The Central Limit Theorem, therefore, tells us that the sample mean X¯ is approximately normally distributed with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=6n

Again, we'll follow a strategy similar to that in the above example, namely:

  1. Specify the sample size n.
  2. Randomly generate 1000 samples of size n from the chi-square(3) distribution.
  3. Use the 1000 generated samples to calculate 1000 sample means from the chi-square(3) distribution.
  4. Create a histogram of the 1000 sample means.
  5. Compare the histogram to the normal distribution, as defined by the Central Limit Theorem, in order to see how well the Central Limit Theorem works for the given sample size n.

Again, starting with a sample size of n=1, we randomly sample 1000 numbers from a chi-square(3) distribution, and create a histogram of the 1000 generated numbers. Of course, the histogram should look like a (skewed) chi-square(3) distribution, as the blue curve suggests it does:

0.05.02.57.510.012.515.017.50.000.050.100.150.200.25DensityHistogram of X3GammaX3

Now, let's consider samples of size n=2. Generating 1000 samples of size n=2, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

04268100.000.050.100.150.200.25DensityHistogram of Mean 2NormalMean 2

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=62=3

As you can see, at n=2, the normal curve wouldn't do a very job of approximating the exact probabilities. The probability distribution of the sample mean still appears to be quite skewed. Let's increase the sample size to n=4. Generating 1000 samples of size n=4, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

13246570.000.050.100.150.200.250.300.35DensityHistogram of Mean 4NormalMean 4

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=64=1.5

Although, at n=4, the normal curve is doing a better job of approximating the probability distribution of the sample mean, there is still much room for improvement. Let's try n=9. Generating 1000 samples of size n=9, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

1.63.22.44.05.64.80.00.10.20.30.40.50.6DensityHistogram of Mean 9NormalMean 9

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=69=0.667

We're getting closer, but let's really jump up the sample size to, say, n=25. Generating 1000 samples of size n=25, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:

2.03.22.81.62.43.64.44.00.00.20.40.60.81.0DensityHistogram of Mean 25NormalMean 25

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=625=0.24

  1. Okay, now we're talking! There's still just a teeny tiny bit of skewness in the sampling distribution. Let's increase the sample size just one more time to, say, n=36. Generating 1000 samples of size n=36, calculating the 1000 sample means, and creating a histogram of the 1000 sample means, we get:
1.63.22.44.05.64.80.00.20.40.60.81.01.2DensityHistogram of Mean 36NormalMean 36

The blue curve overlaid on the histogram is the normal distribution, as defined by the Central Limit Theorem. That is, the blue curve is the normal distribution with mean:

μX¯=μ=3

and variance:

σX¯2=σ2n=636=0.167

Okay, now, I'm perfectly happy! It appears that, at n=36, the normal curve does a very good job of approximating the exact probabilities. Let's summarize the two take-away messages from this example:

  1. Again, the larger the sample size n, the smaller the variance of the sample mean. Nothing new there.
  2. If the underlying distribution is skewed, then you need a larger sample size, typically n>30, for the normal distribution, as defined by the Central Limit Theorem, to do a decent job of approximating the probability distribution of the sample mean.

27.3 - Applications in Practice

27.3 - Applications in Practice

Now that we have an intuitive feel for the Central Limit Theorem, let's use it in two different examples. In the first example, we use the Central Limit Theorem to describe how the sample mean behaves, and then use that behavior to calculate a probability. In the second example, we take a look at the most common use of the CLT, namely to use the theorem to test a claim.

Example 27-3

three over two all times x squared graph

Take a random sample of size n=15 from a distribution whose probability density function is:

f(x)=32x2

for 1<x<1. What is the probability that the sample mean falls between 25 and 15?

Solution

The expected value of the random variable X is 0, as the following calculation illustrates:

μ=E(X)=11x32x2dx=3211x3dx=32[x44]x=1x=1=32(1414)=0

The variance of the random variable X is 35, as the following calculation illustrates:

σ2=E(Xμ)2=11(x0)232x2dx=3211x4dx=32[x55]x=1x=1=32(15+15)=35

Therefore, the CLT tells us that the sample mean X¯ is approximately normal with mean:

E(X¯)=μX¯=μ=0

and variance:

Var(X¯)=σX¯2=σ2n=3/515=375=125

Therefore the standard deviation of X¯ is 15. Drawing a picture of the desired probability:

Z0-21-3/53/5-2/52/5-1/51/5

we see that:

P(2/5<X¯<1/5)=P(2<Z<1)

Therefore, using the standard normal table, we get:

P(2/5<X¯<1/5)=P(Z<1)P(Z<2)=0.84130.0228=0.8185

That is, there is an 81.85% chance that a random sample of size 15 from the given distribution will yield a sample mean between 25 and 15.

Example 27-4

people standing in a line

Let Xi denote the waiting time (in minutes) for the ith customer. An assistant manager claims that μ, the average waiting time of the entire population of customers, is 2 minutes. The manager doesn't believe his assistant's claim, so he observes a random sample of 36 customers. The average waiting time for the 36 customers is 3.2 minutes. Should the manager reject his assistant's claim (... and fire him)?

Solution

It is reasonable to assume that Xi is an exponential random variable. And, based on the assistant manager's claim, the mean of Xi is:

μ=θ=2.

Therefore, knowing what we know about exponential random variables, the variance of Xi is:

σ2=θ2=22=4.

Now, we need to know, if the mean μ really is 2, as the assistant manager claims, what is the probability that the manager would obtain a sample mean as large as (or larger than) 3.2 minutes? Well, the Central Limit Theorem tells us that the sample mean X¯ is approximately normally distributed with mean:

μX¯=2

and variance:

σX¯2=σ2n=436=19

Here's a picture, then, of the normal probability that we need to determine:

3.2Z2

z=3.2219=3.6

That is:

P(X¯>3.2)=P(Z>3.6)

The Z value in this case is so extreme that the table in the back of our text book can't help us find the desired probability. But, using statistical software, such as Minitab, we can determine that:

P(X¯>3.2)=P(Z>3.6)=0.00016

That is, if the population mean μ really is 2, then there is only a 16/100,000 chance (0.016%) of getting such a large sample mean. It would be quite reasonable, therefore, for the manager to reject his assistant's claim that the mean μ is 2. The manager should feel comfortable concluding that the population mean μ really is greater than 2. We will leave it up to him to decide whether or not he should fire his assistant!

By the way, this is the kind of example that we'll see when we study hypothesis testing in Stat 415. In general, in the process of performing a hypothesis test, someone makes a claim (the assistant, in this case), and someone collects and uses the data (the manager, in this case) to make a decision about the validity of the claim. It just so happens to be that we used the CLT in this example to help us make a decision about the assistant's claim. [12]


Lesson 28: Approximations for Discrete Distributions

Lesson 28: Approximations for Discrete Distributions

Overview

In the previous lesson, we explored the Central Limit Theorem, which states that if X1,X2,,Xn is a random sample of "sufficient" size n from a population whose mean is μ and standard deviation is σ, then:

Z=X¯μσ/n=i=1nXinμnσdN(0,1)

In that lesson, all of the examples concerned continuous random variables. In this lesson, our focus will be on applying the Central Limit Theorem to discrete random variables. In particular, we will investigate how to use the normal distribution to approximate binomial probabilities and Poisson probabilities.

Objectives

Upon completion of this lesson, you should be able to:

  • To learn how to use the normal distribution to approximate binomial probabilities.
  • To learn how to use the normal distribution to approximate Poisson probabilities.
  • To be able to apply the methods learned in this lesson to new problems.

28.1 - Normal Approximation to Binomial

28.1 - Normal Approximation to Binomial

As the title of this page suggests, we will now focus on using the normal distribution to approximate binomial probabilities. The Central Limit Theorem is the tool that allows us to do so. As usual, we'll use an example to motivate the material.

Example 28-1

the white house

Let Xi denote whether or not a randomly selected individual approves of the job the President is doing. More specifically:

  • Let Xi=1, if the person approves of the job the President is doing, with probability p
  • Let Xi=0, if the person does not approve of the job the President is doing with probability 1p

Then, recall that Xi is a Bernoulli random variable with mean:

μ=E(X)=(0)(1p)+(1)(p)=p

and variance:

σ2=Var(X)=E[(Xp)2]=(0p)2(1p)+(1p)2(p)=p(1p)[p+1p]=p(1p)

Now, take a random sample of n people, and let:

Y=X1+X2++Xn

Then Y is a binomial(n,p) random variable, y=0,1,2,,n, with mean:

μ=np

and variance:

σ2=np(1p)

Now, let n=10 and p=12, so that Y is binomial(10,12). What is the probability that exactly five people approve of the job the President is doing?

Solution

There is really nothing new here. We can calculate the exact probability using the binomial table in the back of the book with n=10 and p=12. Doing so, we get:

P(Y=5)=P(Y5)P(Y4)=0.62300.3770=0.2460

That is, there is a 24.6% chance that exactly five of the ten people selected approve of the job the President is doing.

Note, however, that Y in the above example is defined as a sum of independent, identically distributed random variables. Therefore, as long as n is sufficiently large, we can use the Central Limit Theorem to calculate probabilities for Y. Specifically, the Central Limit Theorem tells us that:

Z=Ynpnp(1p)dN(0,1).

Let's use the normal distribution then to approximate some probabilities for Y. Again, what is the probability that exactly five people approve of the job the President is doing?

Solution

First, recognize in our case that the mean is:

μ=np=10(12)=5

and the variance is:

σ2=np(1p)=10(12)(12)=2.5

Now, if we look at a graph of the binomial distribution with the rectangle corresponding to Y=5 shaded in red:

02468100.000.050.0010.0100.0440.1170.2050.2460.100.150.200.25DensityHistogram of YNormalYMean - 5StDev - 1.581N - 1000

we should see that we would benefit from making some kind of correction for the fact that we are using a continuous distribution to approximate a discrete distribution. Specifically, it seems that the rectangle Y=5 really includes any Y greater than 4.5 but less than 5.5. That is:

P(Y=5)=P(4.5<Y<5.5)

Such an adjustment is called a "continuity correction." Once we've made the continuity correction, the calculation reduces to a normal probability calculation:

Now, recall that we previous used the binomial distribution to determine that the probability that Y=5 is exactly 0.246. Here, we used the normal distribution to determine that the probability that Y=5 is approximately 0.251. That's not too shabby of an approximation, in light of the fact that we are dealing with a relative small sample size of n=10!

Let's try a few more approximations. What is the probability that more than 7, but at most 9, of the ten people sampled approve of the job the President is doing?

Solution

If we look at a graph of the binomial distribution with the area corresponding to 7<Y9 shaded in red:

02468100.000.050.0010.0100.0440.1170.2050.2460.100.150.200.25DensityHistogram of YNormalYMean - 5StDev - 1.581N - 1000

we should see that we'll want to make the following continuity correction:

P(7<Y9)=P(7.5<Y<9.5)

Now again, once we've made the continuity correction, the calculation reduces to a normal probability calculation:

By the way, you might find it interesting to note that the approximate normal probability is quite close to the exact binomial probability. We showed that the approximate probability is 0.0549, whereas the following calculation shows that the exact probability (using the binomial table with n=10 and p=12 is 0.0537:

P(7<Y9)=P(Y9)P(Y7)=0.99900.9453=0.0537

Let's try one more approximation. What is the probability that at least 2, but less than 4, of the ten people sampled approve of the job the President is doing?

Solution

If we look at a graph of the binomial distribution with the area corresponding to 2Y<4 shaded in red:

02468100.000.050.0010.0100.0440.1170.2050.2460.100.150.200.25DensityHistogram of YNormalYMean - 5StDev - 1.581N - 1000

we should see that we'll want to make the following continuity correction:

P(2Y<4)=P(1.5<Y<3.5)

Again, once we've made the continuity correction, the calculation reduces to a normal probability calculation:

P(2Y<4)=P(1.5<Y<3.5)=P(1.552.5<Z<3.552.5)=P(2.21<Z<0.95)=P(Z>0.95)P(Z>2.21)=0.17110.0136=0.1575

By the way, the exact binomial probability is 0.1612, as the following calculation illustrates:

P(2Y<4)=P(Y3)P(Y1)=0.17190.0107=0.1612

Just a couple of comments before we close our discussion of the normal approximation to the binomial.

(1) First, we have not yet discussed what "sufficiently large" means in terms of when it is appropriate to use the normal approximation to the binomial. The general rule of thumb is that the sample size n is "sufficiently large" if:

np5 and n(1p)5

For example, in the above example, in which p=0.5, the two conditions are met if:

np=n(0.5)5 and n(1p)=n(0.5)5

Now, both conditions are true if:

n5(105)=10

Because our sample size was at least 10 (well, barely!), we now see why our approximations were quite close to the exact probabilities. In general, the farther p is away from 0.5, the larger the sample size n is needed. For example, suppose p=0.1. Then, the two conditions are met if:

np=n(0.1)5 and n(1p)=n(0.9)5

Now, the first condition is met if:

n5(10)=50

And, the second condition is met if:

n5(109)=5.5

That is, the only way both conditions are met is if n50. So, in summary, when p=0.5, a sample size of n=10 is sufficient. But, if p=0.1, then we need a much larger sample size, namely n=50.

(2) In truth, if you have the available tools, such as a binomial table or a statistical package, you'll probably want to calculate exact probabilities instead of approximate probabilities. Does that mean all of our discussion here is for naught? No, not at all! In reality, we'll most often use the Central Limit Theorem as applied to the sum of independent Bernoulli random variables to help us draw conclusions about a true population proportion p. If we take the Z random variable that we've been dealing with above, and divide the numerator by n and the denominator by n (and thereby not changing the overall quantity), we get the following result:

Z=Xinpnp(1p)=p^pp(1p)ndN(0,1)

The quantity:

p^=i=1nXin

that appears in the numerator is the "sample proportion," that is, the proportion in the sample meeting the condition of interest (approving of the President's job, for example). In Stat 415, we'll use the sample proportion in conjunction with the above result to draw conclusions about the unknown population proportion p. You'll definitely be seeing much more of this in Stat 415! [15]


28.2 - Normal Approximation to Poisson

28.2 - Normal Approximation to Poisson

Just as the Central Limit Theorem can be applied to the sum of independent Bernoulli random variables, it can be applied to the sum of independent Poisson random variables. Suppose Y denotes the number of events occurring in an interval with mean λ and variance λ. Now, if X1,X2,,Xλ are independent Poisson random variables with mean 1, then:

Y=i=1λXi

is a Poisson random variable with mean λ. (If you're not convinced of that claim, you might want to go back and review the homework for the lesson on The Moment Generating Function Technique, in which we showed that the sum of independent Poisson random variables is a Poisson random variable.) So, now that we've written Y as a sum of independent, identically distributed random variables, we can apply the Central Limit Theorem. Specifically, when λ is sufficiently large:

Z=YλλdN(0,1)

We'll use this result to approximate Poisson probabilities using the normal distribution.

Example 28-2

building collapsed from an earthquake

The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability that at least 9 such earthquakes will strike next year? (Adapted from An Introduction to Mathematical Statistics, by Richard J. Larsen and Morris L. Marx.)

Solution.

We can, of course use the Poisson distribution to calculate the exact probability. Using the Poisson table with λ=6.5, we get:

P(Y9)=1P(Y8)=10.792=0.208

Now, let's use the normal approximation to the Poisson to calculate an approximate probability. First, we have to make a continuity correction. Doing so, we get:

P(Y9)=P(Y>8.5)

Once we've made the continuity correction, the calculation again reduces to a normal probability calculation:

P(Y9)=P(Y>8.5)=P(Z>8.56.56.5)=P(Z>0.78)=0.218

So, in summary, we used the Poisson distribution to determine the probability that Y is at least 9 is exactly 0.208, and we used the normal distribution to determine the probability that Y is at least 9 is approximately 0.218. Not too shabby of an approximation!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility

Links: