# Lesson 10: Hypothesis Testing

Lesson 10: Hypothesis Testing

## Lesson Overview

In Lesson 2 we saw the value of random assignment in designed experiments. Random assignment alleviates the bias that might cause a systematic difference between groups unrelated to the treatment itself. Precautions like blinding that ensure that the subjects are treated the same during the experiment then leave us with just two possibilities for the cause of differences seen between groups. Either:

• the treatment was effective in producing the changes (the research hypothesis), or
• differences were just the result of the luck of the draw (the null hypothesis).

This shows the importance of addressing the concept of statistical significance. If it is very unlikely that the results of a randomized experiment are just the result of random chance, then we are left with the treatment itself as the probable cause of any relationship seen. Even in an observational study, being able to show that random chance is a poor explanation of the data is still good evidence for a true association in the population (even though it is poor evidence of causality).

This lesson focuses on Statistical hypothesis testing. In a significance test, you carry out a probability calculation assuming the null hypothesis is true to see if random chance is a plausible explanation for the data. Let's illustrate the process with an example.

## Example 10.1 Physical theory suggests that when a coin is spun on a table (rather than flipped in the air) the probability it lands heads up is less than 0.5. We are hesitant to believe this without proof.

To test the theory we carry out an experiment and independently spin a penny 100 times getting 37 heads and 63 tails. Thus, the observed proportion of heads is 37 / 100 = 0.37

We have two possible explanations for the data:

Null Hypothesis: The data is merely a reflection of chance variation. The probability of heads when a penny is spun is really p = 0.5

vs.

Alternative Hypothesis: The probability of heads when a penny is spun is really < 0.5.

A statistical hypothesis test is designed to answer the question: "Does the Null Hypothesis provide a reasonable explanation of the data?”

To answer this question we carry out a probability calculation. First, we can calculate a

Test Statistic = a measure of the difference between the data and what is expected when the null hypothesis is true.

In our example, the null hypothesis says the number of heads in 100 spins would closely follow the normal distribution with p = 0.5. So, if the null hypothesis is true, we expect half (0.5) heads give or take a standard deviation of

$\sqrt{\frac{0.5(1-0.5)}{100}}=0.05$

Further, we can see how unusual our data is if the null hypothesis is true by finding the standard score z for the test statistic and using the normal curve:

$z = (0.37-0.5)/0.05 = -2.6$

How unusual is the value we got, assuming the null hypothesis (i.e., the real proportion is 0.5) is true? We know that standard scores of -2.6 or lower only happen about 0.5% of the time. So the null hypothesis provides a poor explanation for our data. This would seem to provide strong evidence that spinning a coin has less than a 50% chance of landing heads.

## Objectives

After successfully completing this lesson, you should be able to:

• Formulate appropriate null and alternative hypotheses.
• Identify the type 1 and the type 2 error in the context of the problem.
• Use  the four basic steps to carry out a significance test in some basic situations.
• Interpret a p-value in terms of the problem.
• State an appropriate conclusion for a hypothesis test.

# 10.1 - Setting the Hypotheses: Examples

10.1 - Setting the Hypotheses: Examples

A significance test examines whether the null hypothesis provides a plausible explanation of the data. The null hypothesis itself does not involve the data. It is a statement about a parameter (a numerical characteristic of the population). These population values might be proportions or means or differences between means or proportions or correlations or odds ratios or any other numerical summary of the population. The alternative hypothesis is typically the research hypothesis of interest. Here are some examples.

## Example 10.2: Hypotheses with One Sample of One Categorical Variable

About 10% of the human population is left-handed. Suppose a researcher at Penn State speculates that students in the College of Arts and Architecture are more likely to be left-handed than people found in the general population. We only have one sample since we will be comparing a population proportion based on a sample value to a known population value.

• Research Question: Are artists more likely to be left-handed than people found in the general population?
• Response Variable: Classification of the student as either right-handed or left-handed
##### State Null and Alternative Hypotheses
• Null Hypothesis: Students in the College of Arts and Architecture are no more likely to be left-handed than people in the general population (population percent of left-handed students in the College of Art and Architecture = 10% or p = .10).
• Alternative Hypothesis: Students in the College of Arts and Architecture are more likely to be left-handed than people in the general population (population percent of left-handed students in the College of Arts and Architecture > 10% or p > .10). This is a one-sided alternative hypothesis.

## Example 10.3: Hypotheses with One Sample of One Measurement Variable A generic brand of the anti-histamine Diphenhydramine markets a capsule with a 50 milligram dose. The manufacturer is worried that the machine that fills the capsules has come out of calibration and is no longer creating capsules with the appropriate dosage.

• Research Question: Does the data suggest that the population mean dosage of this brand is different than 50 mg?
• Response Variable: dosage of the active ingredient found by a chemical assay.
##### State Null and Alternative Hypotheses
• Null Hypothesis: On the average, the dosage sold under this brand is 50 mg (population mean dosage = 50 mg).
• Alternative Hypothesis: On the average, the dosage sold under this brand is not 50 mg (population mean dosage ≠ 50 mg). This is a two-sided alternative hypothesis.

## Example 10.4: Hypotheses with Two Samples of One Categorical Variable Many people are starting to prefer vegetarian meals on a regular basis. Specifically, a researcher believes that females are more likely than males to eat vegetarian meals on a regular basis.

• Research Question: Does the data suggest that females are more likely than males to eat vegetarian meals on a regular basis?
• Response Variable: Classification of whether or not a person eats vegetarian meals on a regular basis
• Explanatory (Grouping) Variable: Sex
##### State Null and Alternative Hypotheses
• Null Hypothesis: There is no sex effect regarding those who eat vegetarian meals on a regular basis (population percent of females who eat vegetarian meals on a regular basis = population percent of males who eat vegetarian meals on a regular basis or pfemales = pmales).
• Alternative Hypothesis: Females are more likely than males to eat vegetarian meals on a regular basis (population percent of females who eat vegetarian meals on a regular basis > population percent of males who eat vegetarian meals on a regular basis or pfemales > pmales). This is a one-sided alternative hypothesis.

## Example 10.5: Hypotheses with Two Samples of One Measurement Variable Obesity is a major health problem today. Research is starting to show that people may be able to lose more weight on a low carbohydrate diet than on a low fat diet.

• Research Question: Does the data suggest that, on the average, people are able to lose more weight on a low carbohydrate diet than on a low fat diet?
• Response Variable: Weight loss (pounds)
• Explanatory (Grouping) Variable: Type of diet
##### State Null and Alternative Hypotheses
• Null Hypothesis: There is no difference in the mean amount of weight loss when comparing a low carbohydrate diet with a low fat diet (population mean weight loss on a low carbohydrate diet = population mean weight loss on a low fat diet).
• Alternative Hypothesis: The mean weight loss should be greater for those on a low carbohydrate diet when compared with those on a low fat diet (population mean weight loss on a low carbohydrate diet > population mean weight loss on a low fat diet). This is a one-sided alternative hypothesis.

## Example 10.6: Hypotheses about the relationship between Two Categorical Variables

• Research Question: Do the odds of having a stroke increase if you inhale second hand smoke? A case-control study of non-smoking stroke patients and controls of the same age and occupation are asked if someone in their household smokes.
• Variables: There are two different categorical variables (Stroke patient vs control and whether the subject lives in the same household as a smoker). Living with a smoker (or not) is the natural explanatory variable and having a stroke (or not) is the natural response variable in this situation.
##### State Null and Alternative Hypotheses
• Null Hypothesis: There is no relationship between whether or not a person has a stroke and whether or not a person lives with a smoker (odds ratio between stroke and second-hand smoke situation is = 1).
• Alternative Hypothesis: There is a relationship between whether or not a person has a stroke and whether or not a person lives with a smoker (odds ratio between stroke and second-hand smoke situation is > 1). This is a one-tailed alternative.
Note!

This research question might also be addressed like example 11.4 by making the hypotheses about comparing the proportion of stroke patients that live with smokers to the proportion of controls that live with smokers.

## Example 10.7: Hypotheses about the relationship between Two Measurement Variables

• Research Question: A financial analyst believes there might be a positive association between the change in a stock's price and the amount of the stock purchased by non-management employees the previous day (stock trading by management being under "insider-trading" regulatory restrictions).
• Variables: Daily price change information (the response variable) and previous day stock purchases by non-management employees (explanatory variable). These are two different measurement variables.
##### State Null and Alternative Hypotheses
• Null Hypothesis: The correlation between the daily stock price change (\$) and the daily stock purchases by non-management employees (\$) = 0.
• Alternative Hypothesis: The correlation between the daily stock price change (\$) and the daily stock purchases by non-management employees (\$) > 0. This is a one-sided alternative hypothesis.

## Example 10.8: Hypotheses about comparing the relationship between Two Measurement Variables in Two Samples • Research Question: Is there a linear relationship between the amount of the bill (\$) at a restaurant and the tip (\$) that was left. Is the strength of this association different for family restaurants than for fine dining restaurants?
• Variables: There are two different measurement variables. The size of the tip would depend on the size of the bill so the amount of the bill would be the explanatory variable and the size of the tip would be the response variable.
##### State Null and Alternative Hypotheses
• Null Hypothesis: The correlation between the amount of the bill (\$) at a restaurant and the tip (\$) that was left is the same at family restaurants as it is at fine dining restaurants.
• Alternative Hypothesis: The correlation between the amount of the bill (\$) at a restaurant and the tip (\$) that was left is the difference at family restaurants then it is at fine dining restaurants. This is a two-sided alternative hypothesis.

# 10.2 - Steps Used in a Hypothesis Test

10.2 - Steps Used in a Hypothesis Test

Regardless of the type of hypothesis being considered, the process of carrying out a significance test is the same and relies on four basic steps:

1. Step 1: State the null and alternative hypotheses

State the null and alternative hypotheses (see section 10.1) Also think about the type 1 error (rejecting a true null) and type 2 error (declaring the plausibility of a false null) possibilities at this time and how serious each mistake would be in terms of the problem.

2. Step 2: Collect and summarize the data

Collect and summarize the data so that a test statistic can be calculated. A test statistic is a summary of the data that measures the difference between what is seen in the data and what would be expected if the null hypothesis were true. It is typically standardized so that a p-value can be obtained from a reference distribution like the normal curve.

3. Step 3: Use the test statistic to find the p-value

Use the test statistic to find the p-value. The p-value represents the likelihood of getting our test statistic or any test statistic more extreme if, in fact, the null hypothesis is true.

• For a one-sided "greater than" alternative hypothesis, the "more extreme" part of the interpretation refers to test statistic values larger than the test statistic given.
• For a one-sided "less than" alternative hypothesis, the "more extreme" part of the interpretation refers to test statistic values smaller than the test statistic given.
• For a two-sided "not equal to" alternative hypothesis, the "more extreme" part of the interpretation refers to test statistic values that are farther away from the null hypothesis that the test statistic given at either the upper end or lower end of the reference distribution (both "tails").

4. Step 4: Interpret the p-value

Interpret what the p-value is telling you and make a decision using the p-value. Does the null hypothesis provide a reasonable explanation of the data or not? If not it is statistically significant and we have evidence favoring the alternative. State a conclusion in terms of the problem.

#### Common Decision Rules seen in the literature

• If the p-value ≤ .05, we often see scientists declare their data to be "significant."
• If the p-value ≤ .01, we often see scientists declare their data to be "highly significant".
• If the p-value > .05, we often see scientists declare their data to be "not significant".
However, such cut-offs are arbitrary and we should not view data any differently when we see a p-value of 0.049 versus when we see a p-value of 0.051. There is no magic in the 0.05 value.

## Example 10.9: Left Handed Artists: (continuation of example 10.2)

About 10% of the human population is left-handed. A researcher at Penn State speculates that students in the College of Arts and Architecture are more likely to be left-handed that people in the general population. A random sample of 100 students in the College of Arts and Architecture is obtained and 18 of these students were found to be left-handed.

Research Question: Are artists more likely to be left-handed than people in the general population?

1. Step 1: State Null and Alternative Hypotheses
• Null Hypothesis: Population proportion of left-handed students in the College of Art and Architecture = 0.10 (p = 0.10).
• Alternative Hypothesis: Population proportion of left-handed students in the College of Art and Architecture > 0.10 (p > 0.10).

Now that you know the null and alternative hypothesis, did you think about what the type 1 and type 2 errors are? It is important to note that Step 1 is before we even collect data. Identifying these errors helps to improve the design of your research study. Let's write them out:

• Type 1 error: Claim artists are more likely to be left-handed than people in the general population when in truth they are not more likely.
• Type 2 error: Fail to claim artists are more likely to be left-handed than people in the general population when they are in fact more likely.

In this case, the consequences of these two errors are fairly similar (e.g. installing more or fewer left-handed desks in classrooms that are needed).

2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

In the sample of 100 students listed above, the sample proportion is 18 / 100 = 0.18. The hypothesis test will determine whether or not the null hypothesis that p = 0.1 provides a plausible explanation for the data. If not we will see this as evidence that the proportion of left-handed Art & Architecture students is greater than 0.10.

If the null hypothesis is true then the standard error of the sample proportion would be $$\sqrt{\frac{0.1(1-0.1)}{100}} = 0.03$$ and the sample proportion would follow the normal curve. Thus, we can use the standard score z = (0.18-0.10) / 0.03 = 2.67 as our test statistic.

3. Step 3: Use the test statistic to find the p-value.

Using the normal curve table for the Z-value of 2.67 we find the p-value to be about 0.004. Notice that the one-sided alternative hypothesis says to watch out for large values so we look at the percentage of the normal curve above 2.67 to get the p-value. Interpretation of the p-value. The likelihood of getting our test statistic of 2.67 or any higher value, if in fact, the null hypothesis is true, is 0.004.

4. Step 4: Make a decision using the p-value.

Since the p-value of 0.004 is so small, the null hypothesis provides a very poor explanation of the data. We find good evidence that the population proportion of left-handed students in the College of Art and Architecture exceeds 0.10.

Now that we have made our decision, we are only at risk of making a type 1 error. It is not possible at this point to make a type 2 error because we rejected the null hypothesis.

## Example 10.10: The Weight of McDonald's French Fries in Japan After receiving complaints from McDonald's customers in Japan about the amount of french fries being served, the online news magazine "Rocket News" decided to test the actual of the fries served at a particular Japanese McDonald's restaurant. According to the Rocket News article, the official weight standard set by McDonald's of Japan is for a medium-sized fries to weigh 135 grams. The publication weighed the fries from ten different medium fries they purchased and found the average weight of the fries in their sample to be 130 grams with a standard deviation of 9 grams.

Research Question: Does the data suggest that the medium fries from this McDonald's in Japan are underpacked?

1. Step 1: State Null and Alternative Hypotheses.
• Null Hypothesis: Population mean weight of medium fries = 135 grams
• Alternative Hypothesis: Population mean weight of medium fries < 135 grams
2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

The sample mean weight was 130 grams. Also, the sample standard deviation was 9 grams so the standard error of the mean is found to be $$\frac{9}{\sqrt{10}} = 2.85$$ grams. The test statistic would be the standardized value (130-135) / 2.85 = -1.76.

3. Step 3: Use the test statistic to find the p-value.

Since the sample size is only 10, the sample standard deviation would be an unreliable estimate of the population standard deviation so the normal curve would not be appropriate to use as the reference distribution to find the p-value. In this case, the t curve would be used instead and it turns out that the percentage of a t-curve below -1.76 when you have a sample size of 10 is about 6%. Interpretation of the p-value. The likelihood of getting our test statistic of -1.76 or any smaller value, if in fact, the null hypothesis is true, is about 6%.

4. Step 4: Make a decision using the p-value.

Since the p-value is around 6% we are near the border of what people often use as a cutoff for declaring a significant result. Given the amount of variability from one package of fries to the next, there is a reasonable chance that we would see a sample average like this even if the restaurant met the official standard weight on average.

It is important to remember in carrying out the mechanics of a significance test that you are only doing a probability calculation assuming the null hypothesis is true. Because the calculation is done under that assumption, it cannot say anything about the chances that the null hypothesis or the alternative hypothesis are true.

# 10.3 - Tests for Differences

10.3 - Tests for Differences

Significance tests are often used examine the difference between groups in comparative experiments and observational studies. We still use the same four basic steps to carry out the test.   Here are two examples.

## Example 10.11: Biases in Academic Hiring

In a September 2014 paper in the Proceedings of the National Academy of Science, researchers from Cornell University examined how the lifestyles of job candidates might affect how they are evaluated by hiring committees for academic jobs in the sciences. In this experiment 144 professors on hiring committees (80 men and 64 women) were shown two applicant files with equivalent qualifications in terms of published research, teaching abilities, and professional service. However, one file was for a divorced female candidate with two children while the other file was for a married male candidate with two children and a non-working spouse. The results of the hiring preferences exhibited are given in Table 10.1 (note - the actual research report examined a wide variety of gender and lifestyle cases - here we are only showing one comparison studied).

 Divorced Female Candidate Preferred Married Male Candidate Preferred Totals Female Evaluators 45 (70.3%) 19 (29.7%) 64 Male Evaluators 34 (42.5%) 46 (57.5%) 80 Totals 79 (54.9%) 65 (45.1%) 144

Research Question: Does the gender of the evaluator affect the way they would view the fictional divorced female versus the fictional married male candidates?

1. Step 1: State Null and Alternative Hypotheses.
• Null Hypothesis: The gender of the evaluator does not affect the population proportion of evaluators who prefer the female candidate $$( p_{\text{females}} = p_{\text{males}})$$ or $$( p_{\text{females}} - p_{\text{males}}=0)$$
• Alternative Hypothesis: The gender of the evaluator does affect the population proportion of evaluators who prefer the female candidate $$( p_{\text{females}} \neq p_{\text{males}})$$ or $$( p_{\text{females}} - p_{\text{males}}\neq 0)$$ This is a two sided alternative.

2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

The sample proportion for the female evaluators was 0.703 while the sample proportion for the male evaluators was 0.425; a difference of 0.703 - 0.425 = 0.278. If the null hypothesis is true then pfemales = pmales and the best estimate of this common overall probability of preferring the divorced female candidate would be 0.549. Thus, the standard error for the difference in proportions under the null hypothesis would be

$$\sqrt{(\frac{0.549(0.451)}{64})^{2} + (\frac{0.549(0.451)}{80})^{2}}= 0.0834$$

Thus the standardized test statistic would be (0.278 - 0) / 0.0834 = 3.33.

3. Step 3: Use the test statistic to find the p-value.

Since the sample size is fairly large for each group the difference between the two sample proportions would follow the normal curve. Since this is a two-sided alternative, we calculate the p-value by considering both the area above 3.33 and the area below -3.33 on the normal curve (this comes out to about 0.00045 + 0.00045 = 0.0009). Interpretation of the p-value. The likelihood of getting our test statistic of 3.33 or any more extreme value (like those above it or below -3.33), if in fact, the null hypothesis is true, is about 0.0009; a bit less than one-tenth of one percent.

4. Step 4: Make a decision using the p-value.

Since the p-value is so small the results are highly significant; the null hypothesis provides a poor explanation of the data. We have good evidence that there is an association between the gender of the evaluator and the preferences they would hold between candidates with these gender and lifestyle combinations.

## Example 10.12: Don't Drink from the Blue Mug

In a November 2014 article in the journal Flavour, researchers from the University of Oxford in England and the Federation University of Australia investigated whether the aroma of a cup of coffee might be affected by the color of the mug they drink it from. In the experiment, 12 people were randomly selected to drink their coffee from a mug with a blue sleeve and 12 were randomly selected to drink from a mug with a white sleeve (Figure 10.1 shows the mugs used). The subjects were asked to subjectively rate the coffee's aroma on a hundred point scale. The coffee in the white sleeve mugs received an average rating of 57.33 with a standard deviation of 16.27 while the coffee in the blue sleeved mugs received an average rating of 35.57 with a standard deviation of 25.34. Figure 10.1 Mugs used in Coffee Aroma Experiment

Research Question: Does the color of a mug affect the perceived aroma of the coffee inside the mug?

Explanatory variable: the color of the mug (blue or white)

Response variable: subjective rating of aroma on 100 point scale

1. Step 1: State Null and Alternative Hypotheses.
• Null Hypothesis: The color of the mug does not affect the population average aroma rating (meanblue = meanwhite or meanblue - meanblue = 0).
• Alternative Hypothesis: The color of the mug does affect the population average aroma rating (meanblue = meanwhite or meanblue - meanblue = 0). This is a two-sided alternative.
2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

The sample mean for the white cups was 57.33 while the sample mean for the blue cups was 35.57; a difference of 57.33 - 35.57 = 21.76. The standard error of the mean for the white cups was $$\frac{16.27}{\sqrt{12}}= 4.70$$ while the standard error for the blue cups was $$\frac{25.34}{\sqrt{12}}= 7.32$$. Thus, the standard error for the difference in proportions under the null hypothesis would be $$\sqrt{(4.70)^{2} + (7.32)^{2}}= 8.70$$.

Finally, the standardized test statistic would be (21.76 - 0) / 8.70 = 2.5

3. Step 3: Use the test statistic to find the p-value.

While the difference in the sample means might nearly follow the normal curve, the estimate of the standard error of the differences might be a bit off from the actual standard error so the use of the t-curve, rather than the normal curve, would be the appropriate reference distribution to calculate the p-value. Since this is a two-sided alternative, we calculate the p-value by considering both the area above 2.5 and the area below -2.5 on the t curve with sample sizes of 12 in each group (this comes out to about 0.01 + 0.01 = 0.02). Interpretation of the p-value. The likelihood of getting our test statistic of 2.5 or any more extreme value (like those above it or below -2.5), if, in fact, the null hypothesis is true, is about 2%.

4. Step 4: Make a decision using the p-value.

Since the p-value is less than 5% so the results might be considered significant; the null hypothesis provides a fairly poor explanation of the data. We have some evidence that there is an association between the color of the mug and the perceived aroma of the coffee.

# 10.4 - Test Yourself!

10.4 - Test Yourself!

## Think About It!

Select the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.

# 10.5 - Have Fun With It!

10.5 - Have Fun With It!

#### Have Fun With It! J.B. Landers ©

What P-Value Means

lyric ©2005 Lawrence Mark Lesser;
sing to the tune of "Row, Row, Row Your Boat"

It is key to know
What p-value means --
It's the chance
(with the null)
you obtain
data that's
At least that extreme!

  Link ↥ Has Tooltip/Popover Toggleable Visibility