Hypothesis Testing

Key Topics:

Basic approach
Null and alternative hypothesis
Decision making and the p-value
Z-test & Nonparametric alternative

Basic approach to hypothesis testing

State a model describing the relationship between the explanatory variables and the outcome variable(s) in the population and the nature of the variability. State all of your assumptions.
Specify the null and alternative hypotheses in terms of the parameters of the model.
Invent a test statistic that will tend to be different under the null and alternative hypotheses.
Using the assumptions of step 1, find the theoretical sampling distribution of the statistic under the null hypothesis of step 2. Ideally the form of the sampling distribution should be one of the “standard distributions”(e.g. normal, t, binomial..)
Calculate a p-value, as the area under the sampling distribution more extreme than your statistic. Depends on the form of the alternative hypothesis.
Choose your acceptable type 1 error rate (alpha) and apply the decision rule: reject the null hypothesis if the p-value is less than alpha, otherwise do not reject.

One sample z-test

Assume data are independently sampled from a normal distribution with unknown mean μ and known variance σ².
Specify:

H₀: μ = μ₀
H₀: μ ≤ μ₀
H₀: μ ≥ μ₀

vs. one of those

H_A: μ ≠ μ₀
H_A: μ > μ₀
H_A: μ < μ₀
Use a z-statistic:
- \(\frac{\bar{X}-\mu_0}{\sigma / \sqrt{n}}\)
- general form is: (estimate - value we are testing)/(st.dev of the estimate)
- z-statistic follows N(0,1) distribution
Calculate the p-value:
- 2 × the area above |z|, area above z,or area below z, or
- compare the statistic to a critical value, |z| ≥ z_α/2, z ≥ z_α, or z ≤ - z_α
Choose the acceptable level of Alpha = 0.05, we conclude …. ?

Making the Decision

It is either likely or unlikely that we would collect the evidence we did given the initial assumption. (Note: “likely” or “unlikely” is measured by calculating a probability!)

If it is likely, then we “do not reject” our initial assumption. There is not enough evidence to do otherwise.

If it is unlikely, then:

either our initial assumption is correct and we experienced an unusual event or,
our initial assumption is incorrect

In statistics, if it is unlikely, we decide to “reject” our initial assumption.

Example: Criminal Trial Analogy

First, state 2 hypotheses, the null hypothesis (“H₀”) and the alternative hypothesis (“H_A”)

H₀: Defendant is not guilty.
H_A: Defendant is guilty.

Usually the H₀ is a statement of “no effect”, or “no change”, or “chance only” about a population parameter.

While the H_A , depending on the situation, is that there is a difference, trend, effect, or a relationship with respect to a population parameter.

It can one-sided and two-sided.
In two-sided we only care there is a difference, but not the direction of it. In one-sided we care about a particular direction of the relationship. We want to know if the value is strictly larger or smaller.

Then, collect evidence, such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, handwriting samples, etc. (In statistics, the data are the evidence.)

Next, you make your initial assumption.

Defendant is innocent until proven guilty.

In statistics, we always assume the null hypothesis is true.

Then, make a decision based on the available evidence.

If there is sufficient evidence (“beyond a reasonable doubt”), reject the null hypothesis. (Behave as if defendant is guilty.)
If there is not enough evidence, do not reject the null hypothesis. (Behave as if defendant is not guilty.)

If the observed outcome, e.g., a sample statistic, is surprising under the assumption that the null hypothesis is true, but more probable if the alternative is true, then this outcome is evidence against H₀ and in favor of H_A.

An observed effect so large that it would rarely occur by chance is called statistically significant (i.e., not likely to happen by chance).

Using the p-value to make the decision

The p-value represents how likely we would be to observe such an extreme sample if the null hypothesis were true. The p-value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Since it's a probability, it is a number between 0 and 1. The closer the number is to 0 means the event is “unlikely.” So if p-value is “small,” (typically, less than 0.05), we can then reject the null hypothesis.

Significance level and p-value

Significance level, α, is a decisive value for p-value. In this context, significant does not mean “important”, but it means “not likely to happened just by chance”.

α is the maximum probability of rejecting the null hypothesis when the null hypothesis is true. If α = 1 we always reject the null, if α = 0 we never reject the null hypothesis. In articles, journals, etc… you may read: “The results were significant (p<0.05).” So if p=0.03, it's significant at the level of α = 0.05 but not at the level of α = 0.01. If we reject the H₀ at the level of α = 0.05 (which corresponds to 95% CI), we are saying that if H₀ is true, the observed phenomenon would happen no more than 5% of the time (that is 1 in 20). If we choose to compare the p-value to α = 0.01, we are insisting on a stronger evidence!

Very Important Point!

Neither decision of rejecting or not rejecting the H₀ entails proving the null hypothesis or the alternative hypothesis. We merely state there is enough evidence to behave one way or the other. This is also always true in statistics!

So, what kind of error could we make? No matter what decision we make, there is always a chance we made an error.

Errors in Criminal Trial:

	Truth
Jury Decision	Not Guilty	Guilty
Not Guilty	OK	ERROR
Guilty	ERROR	OK

Errors in Hypothesis Testing

Type I error (False positive): The null hypothesis is rejected when it is true.

α is the maximum probability of making a Type I error.

Type II error (False negative): The null hypothesis is not rejected when it is false.

β is the probability of making a Type II error

There is always a chance of making one of these errors. But, a good scientific study will minimize the chance of doing so!

	Truth
Decision	Null Hypothesis	Alternative Hypothesis
Null Hypothesis	OK	TYPE II ERROR
Alternative Hypothesis	TYPE I ERROR	OK

Power

The power of a statistical test is its probability of rejecting the null hypothesis if the null hypothesis is false. That is, power is the ability to correctly reject H₀ and detect a significant effect. In other words, power is one minus the type II error risk.

\(\text{Power }=1-\beta = P\left(\text{reject} H_0 | H_0 \text{is false } \right)\)

Which error is worse?

Type I = you are innocent, yet accused of cheating on the test.

Type II = you cheated on the test, but you are found innocent.

This depends on the context of the problem too. But in most cases scientists are trying to be “conservative”; it's worse to make a spurious discovery than to fail to make a good one. Our goal it to increase the power of the test that is to minimize the length of the CI.

We need to keep in mind:

the effect of the sample size,
the correctness of the underlying assumptions about the population,
statistical vs. practical significance, etc…

(see the handout). To study the tradeoffs between the sample size, α, and Type II error we can use power and operating characteristic curves.

Height Example

One sample z-test

Assume data are independently sampled from a normal distribution with unknown mean μ and known variance σ² = 9. Make an initial assumption that μ = 65.

Specify the hypothesis: H₀: μ = 65 H_A: μ ≠ 65

z-statistic: 3.58

z-statistic follow N(0,1) distribution

SAS output

The p-value, < 0.0001, indicates that, if the average height in the population is 65 inches, it is unlikely that a sample of 54 students would have an average height of 66.4630.

Alpha = 0.05. Decision: p-value < alpha, thus reject the null hypothesis.

Conclude that the average height is not equal to 65.

What type of error might we have made?

Type I error is claiming that average student height is not 65 inches, when it really is.

Type II error is failing to claim that the average student height is not 65in when it is.

We rejected the null hypothesis, i.e., claimed that the height is not 65, thus making potentially a Type I error. But sometimes the p-value is too low because of the large sample size, and we may have statistical significance but not really practical significance! That's why most statisticians are much more comfortable with using CI than tests.

Height Example

Graphical summary of the z-test

histogram

Based on the CI only, how do you know that you should reject the null hypothesis?

The 95% CI is (65.6628,67.2631) ...

What about practical and statistical significance now? Is there another reason to suspect this test, and the p-value calculations?

There is a need for a further generalization. What if we can't assume that σ is known? In this case we would use s (the sample standard deviation) to estimate σ.

If the sample is very large, we can treat σ as known by assuming that σ = s. According to the law of large numbers, this is not too bad a thing to do. But if the sample is small, the fact that we have to estimate both the standard deviation and the mean adds extra uncertainty to our inference. In practice this means that we need a larger multiplier for the standard error.

We need one-sample t-test.

One sample t-test

Assume data are independently sampled from a normal distribution with unknown mean μ and variance σ². Make an initial assumption, μ₀.
Specify:

H₀: μ = μ₀
H₀: μ ≤ μ₀
H₀: μ ≥ μ₀

vs. one of those

H_A: μ ≠ μ₀
H_A: μ > μ₀
H_A: μ < μ₀
t-statistic: \(\frac{\bar{X}-\mu_0}{s / \sqrt{n}}\) where s is a sample st.dev.
t-statistic follows t-distribution with df = n - 1
p-value:
Alpha = 0.05, we conclude ….

Testing for the population proportion

Let's go back to our CNN poll. Assume we have a SRS of 1,017 adults.

We are interested in testing the following hypothesis: H₀: p = 0.50 vs. p > 0.50

What is the test statistic?

If alpha = 0.05, what do we conclude?

We will see more details in the next lesson on proportions, then distributions, and possible tests.