Lesson 19: Conditional Distributions
Lesson 19: Conditional DistributionsOverview
In the last two lessons, we've concerned ourselves with how two random variables \(X\) and \(Y\) behave jointly. We'll now turn to investigating how one of the random variables, say \(Y\), behaves given that another random variable, say \(X\), has already behaved in a certain way. In the discrete case, for example, we might want to know the probability that \(Y\), the number of car accidents in July on a particular curve in the road, equals 2 given that \(X\), the number of cars in June caught speeding on the curve is more than 50. Of course, our previous work investigating conditional probability will help us here. Now, we will extend the idea of conditional probability that we learned previously to the idea of finding a conditional probability distribution of a random variable \(Y\) given another random variable \(X\).
Objectives
 To learn the distinction between a joint probability distribution and a conditional probability distribution.
 To recognize that a conditional probability distribution is simply a probability distribution for a subpopulation.
 To learn the formal definition of a conditional probability mass function of a discrete r.v. \(Y\) given a discrete r.v. \(X\).
 To learn how to calculate the conditional mean and conditional variance of a discrete r.v. \(Y\) given a discrete r.v. \(X\).
 To be able to apply the methods learned in the lesson to new problems.
19.1  What is a Conditional Distribution?
19.1  What is a Conditional Distribution?Let's start our investigation of conditional distributions by using an example to help enlighten us about the distinction between a joint (bivariate) probability distribution and a conditional probability distribution.
Example 191
A Safety Officer for an auto insurance company in Connecticut was interested in learning how the extent of an individual's injury in an automobile accident relates to the type of safety restraint the individual was wearing at the time of the accident. As a result, the Safety Officer used statewide ambulance and police records to compile the following twoway table of joint probabilities:
f (x,y)  Type of Restraint (Y)  
Extent of Injury (X)  None (0)  Belt Only (1)  Belt and Harness (2)  fx(x) 
None (0)  0.065  0.075  0.06  0.20 
Minor (1)  0.175  0.16  0.115  0.45 
Major (2)  0.135  0.10  0.065  0.30 
Death (3)  0.025  0.015  0.01  0.05 
f_{Y}(y)  0.40  0.35  0.25  1.00 
For the sake of understanding the Safety Officer's terminology, let's assume that "Belt only" means that the person was only using the lap belt, whereas "Belt and Harness" should be taken to mean that the person was using a lap belt and shoulder strap. (These data must have been collected a loooonnnggg time ago when such an option was legal!) Also, note that the Safety Officer created the random variable \(X\), the extent of injury, by arbitrarily assigning values 0, 1, 2, and 3 to each of the possible outcomes None, Minor, Major, and Death. Similarly, the Safety Officer created the random variable \(Y\), the type of restraint, by arbitrarily assigning values 0, 1, and 2 to each of the possible outcomes None, Belt Only, and Belt and Harness.
Among other things, the Safety Officer was interested in answering the following questions:
 What is the probability that a randomly selected person in an automobile accident was wearing a seat belt and had only a minor injury?
 If a randomly selected person wears no restraint, what is the probability of death?
 If a randomly selected person sustains no injury, what is the probability the person was wearing a belt and harness?
Before we can help the Safety Officer answer his questions, we could benefit from a couple of (informal) definitions under our belt.
There is actually nothing really new here. We should know by now not only informally, but also formally, the definition of a bivariate probability distribution.
Example (continued)
What is the probability a randomly selected person in an accident was wearing a seat belt and had only a minor injury?
Solution
Let \(A\) = the event that a randomly selected person in a car accident has a minor injury. Let \(B\) = the event that the randomly selected person was wearing only a seat belt. Then, just reading the value right off of the Safety Officer's table, we get:
\(P(A\text{ and }B)=P(X=1, Y=1)=f(1, 1)=0.16\)
That is, there is a 16% chance that a randomly selected person in an accident is wearing a seat belt and has only a minor injury.
Now, of course, in order to define the joint probability distribution of \(X\) and \(Y\) fully, we'd need to find the probability that \(X=x\) and \(Y=y\) for each element in the joint support \(S\), not just for one element \(X=1\) and \(Y=1\). But, that's not our point here. Here, we are revisiting the meaning of the joint probability distribution of \(X\) and \(Y\) just so we can distinguish between it and a conditional probability distribution.
 Conditional Probability Distribution
 A conditional probability distribution is a probability distribution for a subpopulation. That is, a conditional probability distribution describes the probability that a randomly selected person from a subpopulation has the one characteristic of interest.
Example (continued)
If a randomly selected person wears no restraint, what is the probability of death?
Solution
As you can see, the Safety Officer is wanting to know a conditional probability. So, we need to use the definition of conditional probability to calculate the desired probability. But, let's first dissect the Safety Officer's question into two parts by identifying the subpopulation and the characteristic of interest. Well, the subpopulation is the population of people wearing no restraints ( \(NR\) ), and the characteristic of interest is death (\(D\)). Then, using the definition of conditional probability, we determine that the desired probability is:
\(P(DNR)=\dfrac{P(D \cap NR)}{P(NR)}=\dfrac{P(X=3,Y=0)}{P(Y=0)}=\dfrac{f(3,0)}{f_Y(0)}=\dfrac{0.025}{0.40}=0.0625\)
That is, there is a 6.25% chance of death of a randomly selected person in an automobile accident, if the person wears no restraint.
In order to define the conditional probability distribution of \(X\) given \(Y\) fully, we'd need to find the probability that \(X=x\) given \(Y=y\) for each element in the joint support \(S\), not just for one element \(X=3\) and \(Y=0\). But, again, that's not our point here. Here, we are simply trying to get the feel of how a conditional probability distribution describes the probability that a randomly selected person from a subpopulation has the one characteristic of interest.
Example (continued)
If a randomly selected person sustains no injury, what is the probability the person was wearing a seatbelt and harness?
Solution
Again, the Safety Officer is wanting to know a conditional probability. Let's again first dissect the Safety Officer's question into two parts by identifying the subpopulation and the characteristic of interest. Well, here, the subpopulation is the population of people sustaining no injury (\(NI\)), and the characteristic of interest is wearing a seatbelt and harness (\(SH\)). Then, again using the definition of conditional probability, we determine that the desired probability is:
\(P(SHNI)=\dfrac{P(SH \cap NI)}{P(NI)}=\dfrac{P(X=0,Y=2)}{P(X=0)}=\dfrac{f(0,2)}{f_X(0)}=\dfrac{0.06}{0.20}=0.30\)
That is, there is a 30% chance that a randomly selected person in an automobile accident is wearing a seatbelt and harness, if the person sustains no injury.
Again, in order to define the conditional probability distribution of \(Y\) given \(X\) fully, we'd need to find the probability that \(Y=y\) given \(X=x\) for each element in the joint support of \(S\), not just for one element \(X=0\) and \(Y=2\). But, again, that's not our point here. Here, we are again simply trying to get the feel of how a conditional probability distribution describes the probability that a randomly selected person from a subpopulation has the one characteristic of interest.
19.2  Definitions
19.2  DefinitionsNow that we've digested the concept of a conditional probability distribution informally, let's now define it formally for discrete random variables \(X\) and \(Y\). Later, we'll extend the definition for continuous random variables \(X\) and \(Y\).
 Conditional probability mass function of \(X\)

The conditional probability mass function of \(X\), given that \(Y=y\), is defined by:
\(g(xy)=\dfrac{f(x,y)}{f_Y(y)}\qquad \text{provided} f_Y(y)>0\)
Similarly,
 Conditional probability mass function of \(Y\)

The conditional probability mass function of \(Y\), given that \(X=x\), is defined by:
\(h(yx)=\dfrac{f(x,y)}{f_X(x)}\qquad \text{provided} f_X(x)>0\)
Let's get some practice using the definition to find the conditional probability distribution first of \(X\) given \(Y\), and then of \(Y\) given \(X\).
Example 192
Let \(X\) be a discrete random variable with support \(S_1=\{0,1\}\), and let \(X\) be a discrete random variable with support \(S_2=\{0, 1, 2\}\). Suppose, in tabular form, that \(X\) and \(Y\) have the following joint probability distribution \(f(x,y)\):
What is the conditional distribution of \(X\) given \(Y\)? That is, what is \(g(xy)\)?
Solution
Using the formula \(g(xy)=\dfrac{f(x,y)}{f_Y(y)}\), with \(x=0\) and 1, and \(y=0, 1\), and 2, the conditional distribution of \(X\) given \(Y\) is, in tabular form:
For example, the 1/3 in the \(x=0\) and \(y=0\) cell comes from:
That is:
\(g(00)=\dfrac{f(0,0)}{f_Y(0)}=\dfrac{1/8}{3/8}=\dfrac{1}{3}\)
And, the 2/3 in the \(x=1\) and \(y=0\) cell comes from:
That is:
\(g(10)=\dfrac{f(1,0)}{f_Y(0)}=\dfrac{2/8}{3/8}=\dfrac{2}{3}\)
The remaining conditional probabilities are calculated in a similar way. Note that the conditional probabilities in the \(g(xy)\) table are colorcoded as blue when y = 0, red when y = 1, and green when y = 2. That isn't necessary, of course, but rather just a device used to emphasize the concept that the probabilities that \(X\) takes on a particular value are given for the three different subpopulations defined by the value of \(Y\).
Note also that it shouldn't be surprising that for each of the three subpopulations defined by \(Y\), if you add up the probabilities that \(X=0\) and \(X=1\), you always get 1. This is just as we would expect if we were adding up the (marginal) probabilities over the support of \(X\). It's just that here we have to do it for each subpopulation rather than the entire population!
Let \(X\) be a discrete random variable with support \(S_1=\{0,1\}\), and let \(Y\) be a discrete random variable with support \(S_2=\{0, 1, 2\}\). Suppose, in tabular form, that \(X\) and \(Y\) have the following joint probability distribution \(f(x,y)\):
What is the conditional distribution of \(Y\) given \(X\)? That is, what is \(h(yx)\)?
Solution
Using the formula \(h(yx)=\dfrac{f(x,y)}{f_X(x)}\), with \(x=0\) and 1, and \(y=0, 1\), and 2, the conditional distribution of \(Y\) given \(X\) is, in tabular form:
For example, the 1/4 in the \(x=0\) and \(y=0\) cell comes from:
That is:
\(h(00)=\dfrac{f(0,0)}{f_X(0)}=\dfrac{1/8}{4/8}=\dfrac{1}{4}\)
And, the 2/4 in the \(x=0\) and \(y=1\) cell comes from:
That is:
\(h(10)=\dfrac{f(0,1)}{f_X(0)}=\dfrac{2/8}{4/8}=\dfrac{2}{4}\)
And, the 1/4 in the \(x=0\) and \(y=2\) cell comes from:
That is:
\(h(20)=\dfrac{f(0,2)}{f_X(0)}=\dfrac{1/8}{4/8}=\dfrac{1}{4}\)
Again, the remaining conditional probabilities are calculated in a similar way. Note that the conditional probabilities in the \(h(yx)\) table are colorcoded as blue when x = 0 and red when x = 1. Again, that isn't necessary, but rather just a device used to emphasize the concept that the probabilities that \(Y\) takes on a particular value are given for the two different subpopulations defined by the value of \(X\).
Note also that it shouldn't be surprising that for each of the two subpopulations defined by \(X\), if you add up the probabilities that \(Y=0\), \(Y=1\), and \(Y=2\), you get a total of 1. This is just as we would expect if we were adding up the (marginal) probabilities over the support of \(Y\). It's just that here, again, we have to do it for each subpopulation rather than the entire population!
Okay, now that we've determined \(h(yx)\), the conditional distribution of \(Y\) given \(X\), and \(g(xy)\), the conditional distribution of \(X\) given \(Y\), you might also want to note that \(g(xy)\) does not equal \(h(yx)\). That is, in general, almost always the case.
So, we've used the definition to find the conditional distribution of \(X\) given \(Y\), as well as the conditional distribution of \(Y\) given \(X\). We should now have enough experience with conditional distributions to believe that the following two statements true:

Conditional distributions are valid probability mass functions in their own right. That is, the conditional probabilities are between 0 and 1, inclusive:
\(0 \leq g(xy) \leq 1 \qquad \text{and}\qquad 0 \leq h(yx) \leq 1 \)
and, for each subpopulation, the conditional probabilities sum to 1:
\(\sum\limits_x g(xy)=1 \qquad \text{and}\qquad \sum\limits_y h(yx)=1 \)

In general, the conditional distribution of \(X\) given \(Y\) does not equal the conditional distribution of \(Y\) given \(X\). That is:
\(g(xy)\ne h(yx)\)
19.3  Conditional Means and Variances
19.3  Conditional Means and VariancesNow that we've mastered the concept of a conditional probability mass function, we'll now turn our attention to finding conditional means and variances. We'll start by giving formal definitions of the conditional mean and conditional variance when \(X\) and \(Y\) are discrete random variables. And then we'll end by actually calculating a few!
Definition. Suppose \(X\) and \(Y\) are discrete random variables. Then, the conditional mean of \(Y\) given \(X=x\) is defined as:
\(\mu_{YX}=E[Yx]=\sum\limits_y yh(yx)\)
And, the conditional mean of \(X\) given \(Y=y\) is defined as:
\(\mu_{XY}=E[Xy]=\sum\limits_x xg(xy)\)
The conditional variance of \(Y\) given \(X=x\) is:
\(\sigma^2_{Yx}=E\{[Y\mu_{Yx}]^2x\}=\sum\limits_y [y\mu_{Yx}]^2 h(yx)\)
or, alternatively, using the usual shortcut:
\(\sigma^2_{Yx}=E[Y^2x]\mu^2_{Yx}=\left[\sum\limits_y y^2 h(yx)\right]\mu^2_{Yx}\)
And, the conditional variance of \(X\) given \(Y=y\) is:
\(\sigma^2_{Xy}=E\{[X\mu_{Xy}]^2y\}=\sum\limits_x [x\mu_{Xy}]^2 g(xy)\)
or, alternatively, using the usual shortcut:
\(\sigma^2_{Xy}=E[X^2y]\mu^2_{Xy}=\left[\sum\limits_x x^2 g(xy)\right]\mu^2_{Xy}\)
As you can see by the formulas, a conditional mean is calculated much like a mean is, except you replace the probability mass function with a conditional probability mass function. And, a conditional variance is calculated much like a variance is, except you replace the probability mass function with a conditional probability mass function. Let's return to one of our examples to get practice calculating a few of these guys.
Example 193
Let \(X\) be a discrete random variable with support \(S_1=\{0,1\}\), and let \(Y\) be a discrete random variable with support \(S_2=\{0, 1, 2\}\). Suppose, in tabular form, that \(X\) and \(Y\) have the following joint probability distribution \(f(x,y)\):
What is the conditional mean of \(Y\) given \(X=x\)?
Solution
We previously determined that the conditional distribution of \(Y\) given \(X\) is:
Therefore, we can use it, that is, \(h(yx)\), and the formula for the conditional mean of \(Y\) given \(X=x\) to calculate the conditional mean of \(Y\) given \(X=0\). It is:
\(\mu_{Y0}=E[Y0]=\sum\limits_y yh(y0)=0\left(\dfrac{1}{4}\right)+1\left(\dfrac{2}{4}\right)+2\left(\dfrac{1}{4}\right)=1\)
And, we can use \(h(yx)\) and the formula for the conditional mean of \(Y\) given \(X=x\) to calculate the conditional mean of \(Y\) given \(X=1\). It is:
\(\mu_{Y1}=E[Y1]=\sum\limits_y yh(y1)=0\left(\dfrac{2}{4}\right)+1\left(\dfrac{1}{4}\right)+2\left(\dfrac{1}{4}\right)=\dfrac{3}{4}\)
Note that the conditional mean of \(YX=x\) depends on \(x\), and depends on \(x\) alone. You might want to think about these conditional means in terms of subpopulations again. The mean of \(Y\) is likely to depend on the subpopulation, as it does here. The mean of \(Y\) is 1 for the \(X=0\) subpopulation, and the mean of \(Y\) is \(\frac{3}{4}\) for the \(X=1\) subpopulation. Intuitively, this dependence should make sense. Rather than calculating the average weight of an adult, for example, you would probably want to calculate the average weight for the subpopulation of females and the average weight for the subpopulation of males, because the average weight no doubt depends on the subpopulation!
What is the conditional mean of \(X\) given \(Y=y\)?
Solution
We previously determined that the conditional distribution of \(X\) given \(Y\) is:
As the conditional distribution of \(X\) given \(Y\) suggests, there are three subpopulations here, namely the \(Y=0\) subpopulation, the \(Y=1\) subpopulation and the \(Y=2\) subpopulation. Therefore, we have three conditional means to calculate, one for each subpopulation. Now, we can use \(g(xy)\) and the formula for the conditional mean of \(X\) given \(Y=y\) to calculate the conditional mean of \(X\) given \(Y=0\). It is:
\(\mu_{X0}=E[X0]=\sum\limits_x xg(x0)=0\left(\dfrac{1}{3}\right)+1\left(\dfrac{2}{3}\right)=\dfrac{2}{3}\)
And, we can use \(g(xy)\) and the formula for the conditional mean of \(X\) given \(Y=y\) to calculate the conditional mean of \(X\) given \(Y=1\). It is:
\(\mu_{X1}=E[X1]=\sum\limits_x xg(x1)=0\left(\dfrac{2}{3}\right)+1\left(\dfrac{1}{3}\right)=\dfrac{1}{3}\)
And, we can use \(g(xy)\) and the formula for the conditional mean of \(X\) given \(Y=y\) to calculate the conditional mean of \(X\) given \(Y=2\). It is:
\(\mu_{X2}=E[X2]=\sum\limits_x xg(x2)=0\left(\dfrac{1}{2}\right)+1\left(\dfrac{1}{2}\right)=\dfrac{1}{2}\)
Note that the conditional mean of \(XY=y\) depends on \(y\), and depends on \(y\) alone. The mean of \(X\) is \(\frac{2}{3}\) for the \(Y=0\) subpopulation, the mean of \(X\) is \(\frac{1}{3}\) for the \(Y=1\) subpopulation, and the mean of \(X\) is \(\frac{1}{2}\) for the \(Y=2\) subpopulation.
What is the conditional variance of \(Y\) given \(X=0\)?
Solution
We previously determined that the conditional distribution of \(Y\) given \(X\) is:
Therefore, we can use it, that is, \(h(yx)\), and the formula for the conditional variance of \(X\) given \(X=x\) to calculate the conditional variance of \(X\) given \(X=0\). It is:
\begin{align} \sigma^2_{Y0} &= E\{[Y\mu_{Y0}]^2x\}=E\{[Y1]^20\}=\sum\limits_y (y1)^2 h(y0)\\ &= (01)^2 \left(\dfrac{1}{4}\right)+(11)^2 \left(\dfrac{2}{4}\right)+(21)^2 \left(\dfrac{1}{4}\right)=\dfrac{1}{4}+0+\dfrac{1}{4}=\dfrac{2}{4} \end{align}
We could have alternatively used the shortcut formula. Doing so, we better get the same answer:
\begin{align} \sigma^2_{Y0} &= E[Y^20]\mu_{Y0}]^2=\left[\sum\limits_y y^2 h(y0)\right]1^2\\ &= \left[(0)^2\left(\dfrac{1}{4}\right)+(1)^2\left(\dfrac{2}{4}\right)+(2)^2\left(\dfrac{1}{4}\right)\right]1\\ &= \left[0+\dfrac{2}{4}+\dfrac{4}{4}\right]1=\dfrac{2}{4} \end{align}
And we do! That is, no matter how we choose to calculate it, we get that the variance of \(Y\) is \(\frac{1}{2}\) for the \(X=0\) subpopulation.