1.3 - Selecting Individuals at Random

Having assessed the normality of our population of sale prices by looking at the histogram and QQ-plot of sample sale prices, we now return to the task of making probability statements about the population. The crucial question at this point is whether the sample data are representative of the population for which we wish to make statistical inferences. We can then make reliable statistical inferences about the population by considering properties of a model fit to the sample data—provided the model fits reasonably well.

We saw in Section 1.2 that a normal distribution model fits the home prices example reasonably well. However, a standard normal distribution is inappropriate here, because a standard normal distribution has a mean of 0 and a standard deviation of 1, whereas our sample data have a mean of 278.6033 and a standard deviation of 53.8656. We therefore need to consider more general normal distributions with a mean that can take any value and a standard deviation that can take any positive value (standard deviations cannot be negative).

Let Y represent the population values (sale prices in our example) and suppose that Y is normally distributed with mean (or expected value), E(Y), and standard deviation, SD(Y). [More traditional notation uses Greek letters, μ and σ, for these quantities.] We can abbreviate this normal distribution as Normal(E(Y), SD(Y)2), where the first number is the mean and the second number is the square of the standard deviation (also known as the variance). Then the population standardized Z-value, Z = (Y − E(Y)) / SD(Y), has a standard normal distribution with mean 0 and standard deviation 1. In symbols, Y ∼ Normal(E(Y), SD(Y)2) \(\iff\) Z = (Y − E(Y)) / SD(Y) ∼ Normal(0,12).

We are now ready to make a probability statement for the home prices example. Suppose that we would consider a home as being too expensive to buy if its sale price is higher than \(\$\)380,000. What is the probability of finding such an expensive home in our housing market? In other words, if we were to randomly select one home from the population of all homes, what is the probability that it has a sale price higher than \(\$\)380,000? To answer this question we need to make a number of assumptions. We’ve already decided that it is probably safe to assume that the population of sale prices (Price) could be normal, but we don’t know the mean, E(Price), or the standard deviation, SD(Price), of the population of home prices. For now, let's assume that E(Price)=280 and SD(Price)=50 (fairly close to the sample mean of 278.6033 and sample standard deviation of 53.8656). (We’ll be able to relax these assumptions later in this lesson.) From the theoretical result above, Z = (Price − 280) / 50 has a standard normal distribution with mean 0 and standard deviation 1.

Next, to find the probability that a randomly selected Price is greater than 380, we perform some standard algebra on probability statements. In particular, if we write "the probability that a is bigger than b" as "Pr(a > b)," then we can make changes to a (such as adding, subtracting, multiplying, and dividing other quantities) as long as we do the same thing to b. It is perhaps easier to see how this works by example:

Pr(Price > 380) = Pr((Price − 280) / 50 > (380 − 280) / 50) = Pr(Z > 2.00).

The second equality follows since (Price − 280) / 50 is defined to be Z, which is a standard normal random variable with mean 0 and standard deviation 1. From the table in Section 1.2, the probability that a standard normal random variable is greater than 1.96 is 0.025. Thus, Pr(Z > 2.00) is slightly less than 0.025 (draw a picture of a normal density curve with 1.96 and 2.00 marked on the horizontal axis to convince yourself of this fact). In other words, there is slightly less than a 2.5% chance of finding an expensive home (> \(\$\)380,000) in our housing market, under the assumption that Price ∼ Normal(280, 502).

We can also turn these calculations around. For example, which value of Price has a probability of 0.025 to the right of it? To answer this, consider the following calculation (based on the fact that the probability a standard normal random variable is greater than 1.96 is 0.025):

Pr(Z > 1.96) = Pr((Price − 280) / 50 > 1.96) = Pr(Price > 1.96(50) + 280) = Pr(Price > 378).

So, the value 378 has a probability of 0.025 to the right of it. Another way of expressing

this is that "the 97.5th percentile of the variable Price is \(\$\)378,000."