6: Hypothesis Testing, Part 2

6: Hypothesis Testing, Part 2

Objectives

Upon successful completion of this lesson, you should be able to:

  • Identify Type I and Type II errors
  • Select an appropriate significance level (i.e., \(\alpha\) level) for a given scenario
  • Explain the problems associated with conducting multiple tests
  • Interpret the results of a hypothesis test in terms of practical significance
  • Distinguish between practical significance and statistical significance
  • Explain how changing different aspects of a research study would change the statistical power of the tests conducted
  • Compare and contrast confidence intervals and hypothesis tests

Last week you learned how to conduct a hypothesis testing using randomization procedures in StatKey. This week we are going to delve a bit deeper into hypothesis testing. Concepts such as errors, significance (\(\alpha\)) levels, issues with multiple testing, practical significance, and statistical power apply to hypothesis tests for all the parameters that we have learned and will also apply to those that we will learn later in this course. 


6.1 - Type I and Type II Errors

6.1 - Type I and Type II Errors

When conducting a hypothesis test there are two possible decisions: reject the null hypothesis or fail to reject the null hypothesis. You should remember though, hypothesis testing uses data from a sample to make an inference about a population. When conducting a hypothesis test we do not know the population parameters. In most cases, we don't know if our inference is correct or incorrect.

When we reject the null hypothesis there are two possibilities. There could really be a difference in the population, in which case we made a correct decision. Or, it is possible that there is not a difference in the population (i.e., \(H_0\) is true) but our sample was different from the hypothesized value due to random sampling variation. In that case we made an error. This is known as a Type I error.

When we fail to reject the null hypothesis there are also two possibilities. If the null hypothesis is really true, and there is not a difference in the population, then we made the correct decision. If there is a difference in the population, and we failed to reject it, then we made a Type II error.

Type I Error

Rejecting \(H_0\) when \(H_0\) is really true, denoted by \(\alpha\) ("alpha") and commonly set at .05

     \(\alpha=P(Type\;I\;error)\)

Type II Error

Failing to reject \(H_0\) when \(H_0\) is really false, denoted by \(\beta\) ("beta")

     \(\beta=P(Type\;II\;error)\)

Decision Reality
\(H_0\) is true \(H_0\) is false
Reject \(H_0\), (conclude \(H_a\)) Type I error Correct decision
Fail to reject \(H_0\) Correct decision Type II error

Example: Trial

A man goes to trial where he is being tried for the murder of his wife.

We can put it in a hypothesis testing framework. The hypotheses being tested are:

  • \(H_0\) : Not Guilty
  • \(H_a\) : Guilty

Type I error is committed if we reject \(H_0\) when it is true. In other words, did not kill his wife but was found guilty and is punished for a crime he did not really commit.

Type II error is committed if we fail to reject \(H_0\) when it is false. In other words, if the man did kill his wife but was found not guilty and was not punished.

Example: Culinary Arts Study

Asparagus

A group of culinary arts students is comparing two methods for preparing asparagus: traditional steaming and a new frying method. They want to know if patrons of their school restaurant prefer their new frying method over the traditional steaming method. A sample of patrons are given asparagus prepared using each method and asked to select their preference. A statistical analysis is performed to determine if more than 50% of participants prefer the new frying method:

  • \(H_{0}: p = .50\)
  • \(H_{a}: p>.50\)

Type I error occurs if they reject the null hypothesis and conclude that their new frying method is preferred when in reality is it not. This may occur if, by random sampling error, they happen to get a sample that prefers the new frying method more than the overall population does. If this does occur, the consequence is that the students will have an incorrect belief that their new method of frying asparagus is superior to the traditional method of steaming.

Type II error occurs if they fail to reject the null hypothesis and conclude that their new method is not superior when in reality it is. If this does occur, the consequence is that the students will have an incorrect belief that their new method is not superior to the traditional method when in reality it is.


6.2 - Significance Levels

6.2 - Significance Levels

As we saw in the examples on the previous page, the consequences of Type I and Type II errors vary depending on the situation. Researchers take into account the consequences of each when they are setting their \(\alpha\) level before data are even collected.

In many disciplines an \(\alpha\) level of 0.05 is standard, for example in the social sciences. There are some situations when a higher or lower \(\alpha\) level may be desirable. Pilot studies (smaller studies performed before a larger study) often use a higher \(\alpha\) level because their purpose is to gain information about the data that may be collected in a larger study; pilot studies are not typically used to make important decisions.

Studies in which making a Type I error would be more dangerous than making a Type II error may use smaller \(\alpha\) levels. For example, in medical research studies where making a Type I error could mean giving patients ineffective treatments, a smaller \(\alpha\) level may be set in order to reduce the likelihood of such a negative consequence. Lower \(\alpha\) levels mean that smaller p-values are needed to reject the null hypothesis; this makes it more difficult to reject the null hypothesis, but this also reduces the probability of committing a Type I error.


6.3 - Issues with Multiple Testing

6.3 - Issues with Multiple Testing

If we are conducting a hypothesis test with an \(\alpha\) level of 0.05, then we are accepting a 5% chance of making a Type I error (i.e., rejecting the null hypothesis when the null hypothesis is really true). If we would conduct 100 hypothesis tests at a 0.05 \(\alpha\) level where the null hypotheses are really true, we would expect to reject the null and make a Type I error in about 5 of those tests. 

Later in this course you will learn about some statistical procedures that may be used instead of performing multiple tests. For example, to compare the means of more than two groups you can use an analysis of variance ("ANOVA"). To compare the proportions of more than two groups you can conduct a chi-square goodness-of-fit test. 

Publication Bias

A related issue is publication bias. Research studies with statistically significant results are published much more often than studies without statistically significant results. This means that if 100 studies are performed in which there is really no difference in the population, the 5 studies that found statistically significant results may be published while the 95 studies that did not find statistically significant results will not be published. Thus, when you perform a review of published literature you will only read about the studies that found statistically significance results. You would not find the studies that did not find statistically significant results.

Quick Correction for Multiple Tests

One quick method for correcting for multiple tests is to divide the alpha level by the number of tests being conducted. For instance, if you are comparing three groups using a series of three pairwise tests you could divided your overall alpha level ("family-wise alpha level") by three. If we were using a standard alpha level of 0.05, then our pairwise alpha level would be \(\frac{0.05}{3}=0.016667\). We would then compare each of our three p-values to 0.016667 to determine statistical significance. This is known as the Bonferroni method. This is one of the most conservative approaches to controlling for multiple tests (i.e., more likely to make a Type II error). Later in the course you will learn how to use the Tukey method when comparing the means of three or more groups, this approach is often preferred because it is more liberal.


6.4 - Practical Significance

6.4 - Practical Significance

In the last lesson, you learned how to identify statistically significant differences using hypothesis testing methods. If the p value is less than the \(\alpha\) level (typically 0.05), then the results are statistically significant. Results are said to be statistically significant when the difference between the hypothesized population parameter and observed sample statistic is large enough to conclude that it is unlikely to have occurred by chance. 

Practical significance refers to the magnitude of the difference, which is known as the effect size. Results are practically significant when the difference is large enough to be meaningful in real life. What is meaningful may be subjective and may depend on the context.

Note that statistical significance is directly impacted by sample size. Recall that there is an inverse relationship between sample size and the standard error (i.e., standard deviation of the sampling distribution). Very small differences will be statistically significant with a very large sample size. Thus, when results are statistically significant it is important to also examine practical significance. Practical significance is not directly influenced by sample size.

Example: Weight-Loss Program

Researchers are studying a new weight-loss program. Using a large sample they construct a 95% confidence interval for the mean amount of weight loss after six months on the program to be [0.12, 0.20]. All measurements were taken in pounds. Note that this confidence interval does not contain 0, so we know that their results were statistically significant at a 0.05 alpha level. However, most people would say that the results are not practically significant because a six month weight-loss program should yield a mean weight loss much greater than the one observed in this study. 

Effect Size

For some tests there are commonly used measures of effect size. For example, when comparing the difference in two means we often compute Cohen's \(d\) which is the difference between the two observed sample means in standard deviation units:

\[d=\frac{\overline x_1 - \overline x_2}{s_p}\]

Where \(s_p\) is the pooled standard deviation

\[s_p= \sqrt{\frac{(n_1-1)s_1^2 + (n_2 -1)s_2^2}{n_1+n_2-2}}\]

Below are commonly used standards when interpreting Cohen's \(d\):

Cohen's \(d\) Interpretation
0 - 0.2 Little or no effect
0.2 - 0.5 Small effect size
0.5 - 0.8 Medium effect size
0.8 or more Large effect size

For a single mean, you can compute the difference between the observed mean and hypothesized mean in standard deviation units: \[d=\frac{\overline x - \mu_0}{s}\]

For correlation and regression we can compute \(r^2\) which is known as the coefficient of determination. This is the proportion of shared variation. We will learn more about \(r^2\) when we study simple linear regression and correlation at the end of this course.

Example: SAT-Math Scores

Test Taking

Research question:  Are SAT-Math scores at one college greater than the known population mean of 500?

\(H_0\colon \mu = 500\)

\(H_a\colon \mu >500\)

Data are collected from a random sample of 1,200 students at that college. In that sample, \(\overline{x}=506\) and the sample standard deviation was 100. A one-sample mean test was performed and the resulting p-value was 0.0188. Because \(p \leq \alpha\), the null hypothesis should be rejected. These results are statistically significant. There is evidence that the population mean is greater than 500.

But, let's also consider practical significance. The difference between an SAT-Math score 500 and an SAT-Math score of 506 is very small. With a standard deviation of 100, this difference is only \(\frac{506-500}{100}=0.06\) standard deviations. In most cases, this would not be considered practically significant. 

Example: Commute Times

Research question: Are the mean commute times different in Atlanta and St. Louis?

Descriptive Statistics: Commute Time
City N Mean StDev
Atlanta 500 29.110 20.718
St. Louis 500 21.970 14.232

Using the dataset built in to StatKey, a two-tailed randomization test was conducted resulting in a p value < 0.001. Because the null hypothesis was rejected, the results are said to be statistically significant.

Practical significance can be examined by computing Cohen's d. We'll use the equations from above:

\[d=\frac{\overline x_1 - \overline x_2}{s_p}\]

Where \(s_p\) is the pooled standard deviation

\[s_p= \sqrt{\frac{(n_1-1)s_1^2 + (n_2 -1)s_2^2}{n_1+n_2-2}}\]

First, we compute the pooled standard deviation:

\[s_p= \sqrt{\frac{(500-1)20.718^2 + (500-1)14.232^2}{500+500-2}}\]

\[s_p= \sqrt{\frac{(499)(429.236)+ (499)(202.550)}{998}}\]

\[s_p= \sqrt{\frac{214188.527+ 101072.362}{998}}\]

\[s_p= \sqrt{\frac{315260.853}{998}}\]

\[s_p= \sqrt{315.893}\]

\[s_p= 17.773\]

Note: The pooled standard deviation should always be between the two sample standard deviations.

Next, we can compute Cohen's d:

\[d=\frac{29.110-21.970}{17.773}\]

\[d=\frac{7.14}{17.773}\]

\[d= 0.402\]

 

The mean commute time in Atlanta was 0.402 standard deviations greater than the mean commute time in St. Louis. Using the guidelines for interpreting Cohen's d in the table above, this is a small effect size. 


6.5 - Power

6.5 - Power

The probability of rejecting the null hypothesis, given that the null hypothesis is false, is known as power. In other words, power is the probability of correctly rejecting \(H_0\).

Power
\(Power = 1-\beta\)
\(\beta\) = probability of committing a Type II Error.

The power of a test can be increased in a number of ways, for example increasing the sample size, decreasing the standard error, increasing the difference between the sample statistic and the hypothesized parameter, or increasing the alpha level. Using a directional test (i.e., left- or right-tailed) as opposed to a two-tailed test would also increase power. 

When we increase the sample size, decrease the standard error, or increase the difference between the sample statistic and hypothesized parameter, the p value decreases, thus making it more likely that we reject the null hypothesis. When we increase the alpha level, there is a larger range of p values for which we would reject the null hypothesis. Going from a two-tailed to a one-tailed test cuts the p value in half. In all of these cases, we say that statistically power is increased. 

There is a relationship between \(\alpha\) and \(\beta\). If the sample size is fixed, then decreasing \(\alpha\) will increase \(\beta\). If we want both \(\alpha\) and \(\beta\) to decrease (i.e., decreasing the likelihood of both Type I and Type II errors), then we should increase the sample size.

Try it!

Question 1
If the power of a statistical test is increased, for example by increasing the sample size, how does the probability of a Type II error change?

The probability of committing a Type II error is known as \(\beta\).

\(Power+\beta=1\)

\(Power=1-\beta\)

If power increases then \(\beta\) must decrease. So, if the power of a statistical test is increased, for example by increasing the sample size, the probability of committing a Type II error decreases.

Question 2
When we fail to reject the null hypothesis, can we accept the null hypothesis? For example, with a p value of 0.12 we fail to reject the null hypothesis at 0.05 alpha level. Can we say that the data support the null hypothesis?

No. When we perform a hypothesis test, we only set the Type I error rate (i.e., alpha level) and guard against it. Thus, we can only present the strength of evidence against the null hypothesis. We can sidestep the concern about Type II error if the conclusion never mentions that the null hypothesis is accepted. When the null hypothesis cannot be rejected, there are two possible cases:

1) The null hypothesis is really true.

2) The sample size is not large enough to reject the null hypothesis (i.e., statistical power is too low).

Question 3
A study was conducted by a retail store to determine if the majority of their customers were teenagers. With \(\widehat{p}=0.48\), the null hypothesis was not rejected and the company concluded that they did not have enough evidence that the majority of their customers were teenagers. But, in reality, the proportion of all of their customers (i.e., the population) who are teenagers is actually \(p=0.53\). Did this research study result in a Type I error, Type II error, or correct decision?

The result of the study was to fail to reject the null hypothesis. In reality, the null hypothesis was false. This is a Type II error.

Question 4
A university conducted a hypothesis test to determine if their students' average SAT-Math score was greater than the national average of 500. They collected a sample of \(n=800\) students and found \(\overline{x}=506\). The t-test statistic was 1.70 and \(p=0.045\) therefore they rejected the null hypothesis and concluded that the mean SAT-Math score at their university was higher than the national average. However, in reality, in the population of all students at the university, the mean SAT-Math score is 503. Did this research study result in a Type I error, Type II error, or correct decision?
This study came to a correct conclusion. They rejected the null hypothesis and concluded that \(\mu>500\) when in reality \(\mu=503\) which is greater than 500.

6.6 - Confidence Intervals & Hypothesis Testing

6.6 - Confidence Intervals & Hypothesis Testing

Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis. Hypothesis testing requires that we have a hypothesized parameter. 

The simulation methods used to construct bootstrap distributions and randomization distributions are similar. One primary difference is a bootstrap distribution is centered on the observed sample statistic while a randomization distribution is centered on the value in the null hypothesis. 

In Lesson 4, we learned confidence intervals contain a range of reasonable estimates of the population parameter. All of the confidence intervals we constructed in this course were two-tailed. These two-tailed confidence intervals go hand-in-hand with the two-tailed hypothesis tests we learned in Lesson 5. The conclusion drawn from a two-tailed confidence interval is usually the same as the conclusion drawn from a two-tailed hypothesis test. In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always reject the null hypothesis.

Example: Mean

This example uses the Body Temperature dataset built in to StatKey for constructing a bootstrap confidence interval and conducting a randomization test

Let's start by constructing a 95% confidence interval using the percentile method in StatKey:

  samples = 6000 mean = 98.261 std. error = 0.108 125 150 97.90 98.00 98.10 98.20 98.30 98.40 98.50 98.60 0.025 98.044 0.950 0.025 Bootstrap Dotplot of 75 100 50 25 0 98.261 98.474 Mean Left Tail Two - Tail Right Tail  

  

The 95% confidence interval for the mean body temperature in the population is [98.044, 98.474].

Now, what if we want to know if there is enough evidence that the mean body temperature is different from 98.6 degrees? We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis.

\(H_0: \mu=98.6\)

\(H_a: \mu \ne 98.6\)

  samples = 5000 mean = 98.601 std. error = 0.106 100 120 98.30 98.40 98.50 98.60 98.70 98.80 98.90 99.00 0.00080 0.998 0.00080 98.260 80 40 20 0 null = 98.6 98.941 98.6 Left Tail Two - Tail Right Tail Randomization Dotplot of . Null hypothesis: µ =  

\(p = 2*0.00080=0.00160\)

\(p \leq 0.05\), reject the null hypothesis

There is evidence that the population mean is different from 98.6 degrees. 

Selecting the Appropriate Procedure

The decision of whether to use a confidence interval or a hypothesis test depends on the research question. If we want to estimate a population parameter, we use a confidence interval. If we are given a specific population parameter (i.e., hypothesized value), and want to determine the likelihood that a population with that parameter would produce a sample as different as our sample, we use a hypothesis test. Below are a few examples of selecting the appropriate procedure. 

Example: Cheese Consumption

Research question: How much cheese (in pounds) does an average American adult consume annually? 

What is the appropriate inferential procedure? 

Cheese consumption, in pounds, is a quantitative variable. We have one group: American adults. We are not given a specific value to test, so the appropriate procedure here is a confidence interval for a single mean.

Example: Age

Research question: Is the average age in the population of all STAT 200 students greater than 30 years?

What is the appropriate inferential procedure? 

There is one group: STAT 200 students. The variable of interest is age in years, which is quantitative. The research question includes a specific population parameter to test: 30 years. The appropriate procedure is a hypothesis test for a single mean.

Try it!

For each research question, identify the variables, the parameter of interest and decide on the the appropriate inferential procedure.

  1. Research question: How strong is the correlation between height (in inches) and weight (in pounds) in American teenagers?

    There are two variables of interest: (1) height in inches and (2) weight in pounds. Both are quantitative variables. The parameter of interest is the correlation between these two variables.

    We are not given a specific correlation to test. We are being asked to estimate the strength of the correlation. The appropriate procedure here is a confidence interval for a correlation

  2. Research question: Are the majority of registered voters planning to vote in the next presidential election?

    The parameter that is being tested here is a single proportion. We have one group: registered voters. "The majority" would be more than 50%, or p>0.50. This is a specific parameter that we are testing. The appropriate procedure here is a hypothesis test for a single proportion.

  3. Research question: On average, are STAT 200 students younger than STAT 500 students?

    We have two independent groups: STAT 200 students and STAT 500 students. We are comparing them in terms of average (i.e., mean) age.

    If STAT 200 students are younger than STAT 500 students, that translates to \(\mu_{200}<\mu_{500}\) which is an alternative hypothesis. This could also be written as \(\mu_{200}-\mu_{500}<0\), where 0 is a specific population parameter that we are testing. 

    The appropriate procedure here is a hypothesis test for the difference in two means.

  4. Research question: On average, how much taller are adult male giraffes compared to adult female giraffes?

    There are two groups: males and females. The response variable is height, which is quantitative. We are not given a specific parameter to test, instead we are asked to estimate "how much" taller males are than females. The appropriate procedure is a confidence interval for the difference in two means.

  5. Research question: Are STAT 500 students more likely than STAT 200 students to be employed full-time?

    There are two independent groups: STAT 500 students and STAT 200 students. The response variable is full-time employment status which is categorical with two levels: yes/no.

    If STAT 500 students are more likely than STAT 200 students to be employed full-time, that translates to \(p_{500}>p_{200}\) which is an alternative hypothesis. This could also be written as \(p_{500}-p_{200}>0\), where 0 is a specific parameter that we are testing. The appropriate procedure is a hypothesis test for the difference in two proportions.

  6. Research question: Is there is a relationship between outdoor temperature (in Fahrenheit) and coffee sales (in cups per day)?

    There are two variables here: (1) temperature in Fahrenheit and (2) cups of coffee sold in a day. Both variables are quantitative. The parameter of interest is the correlation between these two variables.

    If there is a relationship between the variables, that means that the correlation is different from zero. This is a specific parameter that we are testing. The appropriate procedure is a hypothesis test for a correlation


6.7 - Lesson 6 Summary

6.7 - Lesson 6 Summary

Objectives

Upon successful completion of this lesson, you should be able to:

  • Identify Type I and Type II errors
  • Select an appropriate significance level (i.e., \(\alpha\) level) for a given scenario
  • Explain the problems associated with conducting multiple tests
  • Interpret the results of a hypothesis test in terms of practical significance
  • Distinguish between practical significance and statistical significance
  • Explain how changing different aspects of a research study would change the statistical power of the tests conducted
  • Compare and contrast confidence intervals and hypothesis tests

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility