# 10.3 - Tests for Differences

10.3 - Tests for Differences

Significance tests are often used examine the difference between groups in comparative experiments and observational studies. We still use the same four basic steps to carry out the test.   Here are two examples.

## Example 10.11: Biases in Academic Hiring

In a September 2014 paper in the Proceedings of the National Academy of Science, researchers from Cornell University examined how the lifestyles of job candidates might affect how they are evaluated by hiring committees for academic jobs in the sciences. In this experiment 144 professors on hiring committees (80 men and 64 women) were shown two applicant files with equivalent qualifications in terms of published research, teaching abilities, and professional service. However, one file was for a divorced female candidate with two children while the other file was for a married male candidate with two children and a non-working spouse. The results of the hiring preferences exhibited are given in Table 10.1 (note - the actual research report examined a wide variety of gender and lifestyle cases - here we are only showing one comparison studied).

 Divorced Female Candidate Preferred Married Male Candidate Preferred Totals Female Evaluators 45 (70.3%) 19 (29.7%) 64 Male Evaluators 34 (42.5%) 46 (57.5%) 80 Totals 79 (54.9%) 65 (45.1%) 144

Research Question: Does the gender of the evaluator affect the way they would view the fictional divorced female versus the fictional married male candidates?

1. Step 1: State Null and Alternative Hypotheses.
• Null Hypothesis: The gender of the evaluator does not affect the population proportion of evaluators who prefer the female candidate $$( p_{\text{females}} = p_{\text{males}})$$ or $$( p_{\text{females}} - p_{\text{males}}=0)$$
• Alternative Hypothesis: The gender of the evaluator does affect the population proportion of evaluators who prefer the female candidate $$( p_{\text{females}} \neq p_{\text{males}})$$ or $$( p_{\text{females}} - p_{\text{males}}\neq 0)$$ This is a two sided alternative.

2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

The sample proportion for the female evaluators was 0.703 while the sample proportion for the male evaluators was 0.425; a difference of 0.703 - 0.425 = 0.278. If the null hypothesis is true then pfemales = pmales and the best estimate of this common overall probability of preferring the divorced female candidate would be 0.549. Thus, the standard error for the difference in proportions under the null hypothesis would be

$$\sqrt{(\frac{0.549(0.451)}{64})^{2} + (\frac{0.549(0.451)}{80})^{2}}= 0.0834$$

3. Thus the standardized test statistic would be (0.278 - 0) / 0.0834 = 3.33.

4.
5. Step 3: Use the test statistic to find the p-value.

Since the sample size is fairly large for each group the difference between the two sample proportions would follow the normal curve. Since this is a two-sided alternative, we calculate the p-value by considering both the area above 3.33 and the area below -3.33 on the normal curve (this comes out to about 0.00045 + 0.00045 = 0.0009).

Interpretation of the p-value. The likelihood of getting our test statistic of 3.33 or any more extreme value (like those above it or below -3.33), if in fact, the null hypothesis is true, is about 0.0009; a bit less than one-tenth of one percent.

6. Step 4: Make a decision using the p-value.

Since the p-value is so small the results are highly significant; the null hypothesis provides a poor explanation of the data. We have good evidence that there is an association between the gender of the evaluator and the preferences they would hold between candidates with these gender and lifestyle combinations.

## Example 10.12: Don't Drink from the Blue Mug

In a November 2014 article in the journal Flavour, researchers from the University of Oxford in England and the Federation University of Australia investigated whether the aroma of a cup of coffee might be affected by the color of the mug they drink it from. In the experiment, 12 people were randomly selected to drink their coffee from a mug with a blue sleeve and 12 were randomly selected to drink from a mug with a white sleeve (Figure 10.1 shows the mugs used). The subjects were asked to subjectively rate the coffee's aroma on a hundred point scale. The coffee in the white sleeve mugs received an average rating of 57.33 with a standard deviation of 16.27 while the coffee in the blue sleeved mugs received an average rating of 35.57 with a standard deviation of 25.34.

Figure 10.1 Mugs used in Coffee Aroma Experiment

Research Question: Does the color of a mug affect the perceived aroma of the coffee inside the mug?

Explanatory variable: the color of the mug (blue or white)

Response variable: subjective rating of aroma on 100 point scale

1. Step 1: State Null and Alternative Hypotheses.
• Null Hypothesis: The color of the mug does not affect the population average aroma rating (meanblue = meanwhite or meanblue - meanblue = 0).
• Alternative Hypothesis: The color of the mug does affect the population average aroma rating (meanblue = meanwhite or meanblue - meanblue = 0). This is a two-sided alternative.
2. Step 2: Collect and summarize the data so that a test statistic can be calculated.

The sample mean for the white cups was 57.33 while the sample mean for the blue cups was 35.57; a difference of 57.33 - 35.57 = 21.76. The standard error of the mean for the white cups was $$\frac{16.27}{\sqrt{12}}= 4.70$$ while the standard error for the blue cups was $$\frac{25.34}{\sqrt{12}}= 7.32$$. Thus, the standard error for the difference in proportions under the null hypothesis would be $$\sqrt{(4.70)^{2} + (7.32)^{2}}= 8.70$$.

Finally, the standardized test statistic would be (21.76 - 0) / 8.70 = 2.5

3. Step 3: Use the test statistic to find the p-value.

While the difference in the sample means might nearly follow the normal curve, the estimate of the standard error of the differences might be a bit off from the actual standard error so the use of the t-curve, rather than the normal curve, would be the appropriate reference distribution to calculate the p-value. Since this is a two-sided alternative, we calculate the p-value by considering both the area above 2.5 and the area below -2.5 on the t curve with sample sizes of 12 in each group (this comes out to about 0.01 + 0.01 = 0.02).

Interpretation of the p-value. The likelihood of getting our test statistic of 2.5 or any more extreme value (like those above it or below -2.5), if, in fact, the null hypothesis is true, is about 2%.

4. Step 4: Make a decision using the p-value.

Since the p-value is less than 5% so the results might be considered significant; the null hypothesis provides a fairly poor explanation of the data. We have some evidence that there is an association between the color of the mug and the perceived aroma of the coffee.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility