11.1 - Significance Testing Caveats

Here we take a look at the four principle caveats to watch out for when reading the results of a statistical hypothesis test: the large sample caution; the small sample caution; the multiple testing problem; and the misinterpretation problem.

Example 11.1: Pizza delivery times Section

Pizza in a box

When a pizza is ordered for delivery over the phone, the person answering the call will let the customer know how long to expect to wait before the pizza is delivered to their home. A study carried out in Columbus, Ohio examined the issue of whether the times given tend to overestimate how much time it will take to deliver. The researchers believed that overestimates were more likely than underestimates since the restaurants realize customers will be happier if a pizza is delivered early than if it is delivered late. In the study 198 pizzas were ordered over the period of one week at different restaurants at different times of day. The pizzas arrived an average of 3 minutes early with a standard deviation of 15 minutes. Were the average delivery times significantly early? Let's carry out the significance test:

  1. Step 1

    The parameter of interest is the true mean difference µ between the estimated and actually delivery times (estimated time – actual time) in minutes for all pizza stores. The hypotheses are null: \(\mu\) = 0 alternative: \(\mu\gt\)0 (estimated time is an overestimate)

  2. Step 2

    If the null hypothesis is true and the delivery times are independent then the average of 198 differences between estimated and actual delivery times would have a mean of 0 and a standard error of the mean given by \(15/ \sqrt{198} = 1.07\) minutes. Also, the average differences would closely follow the normal curve. We find the standard score to be z = (3-0) / 1.07 = 2.8.

  3. Step 3

    From the normal curve table we find the p-value to be about 1 - 0.997 = 0.003 or 0.3%.

  4. Step 4

    With such a puny p-value we conclude that the null hypothesis is a very poor explanation of the data. The conclusion: We have significant evidence that the estimated delivery times given over the phone are, on average, later than actual delivery times.

The results are indeed significant in the statistical sense (they can not be explained by random chance). But are they of any practical significance? Does a pizza arriving 3 minutes early have any practical consequences? Or would you consider the 3 minute average difference found in this study to be pretty close to the times given over the phone?

In this example the sample size of 198 pizza orders was quite large, leaving very little variability in the estimate of the mean value at question. With such little variability, even a small difference of no practical consequence is seen as statistically significant. That is the heart of the Large Sample Caution.

 The Large Sample Caution:

With a sufficiently large sample size, one can detect the smallest of departures from the null hypothesis. For studies with large sample sizes, ask yourself if the magnitude of the observed difference from the null hypothesis is of any practical importance.

Example 11.2: Treating Epilepsy in Rural India Section

A study in the journal Lancet reported on a randomized controlled experiment comparing the use of Phenobarbital with Phenytoin for childhood epilepsy in rural India. Because of its low cost, Phenobarbital is recommended by WHO for treating epilepsy in developing countries. This is controversial because of previously reported behavioral side effects. In this study, behavioral problems did not occur at a significantly lower rate in the Phenytoin group and the authors concluded: "This evidence supports the acceptability of Phenobarbital as a first line drug for childhood epilepsy in rural settings in developing countries."

However, there were only 47 patients in each group and because of missing data, many comparisons were based on only 32 patients per group. The standard error for the difference between proportions in groups that size is about 0.125 and results would not be found to be significant unless the difference seen in a study was twice as large. Thus, it is clear that the author's conclusion in this research report is not justified by the evidence. Because of the small sample sizes, even important differences between Phenobarbital and Phenytoin could easily go undetected. This is the heart of the Small Sample Caution.

 The Small Sample Caution:

For very small sample sizes, a very large departure of the sample results from the null hypothesis may not be statistically significant (although it may be of practical concern). This should motivate one to do a better study with a larger sample size.

Example 11.3: If you want a boy, eat your cereal Section

A 2008 study in the British journal the Proceedings of the Royal Society, Biological Sciences, found a significant relationship between how much breakfast cereal a woman eats and whether she has male children. Among 740 British women, they found that women in the top third of cereal eaters had 56% male children while the women in the lowest third of cereal eaters had only 45% male children. But it turns out that 132 different food items were examined so finding some with highly significant results should not come as a surprise. After all, even when the null hypothesis is true there is a 1% chance of getting a p-value less than 1% and declaring the result highly significant (that comes directly from the definition of the p-value). So if you look at 132 significance tests, finding one that is highly significant is very much expected. This is the heart of the Multiple Testing Caution.

 The Multiple Testing Caution:

When a large number of significance tests are conducted, some individual tests may be deemed significant just by chance even if the null hypothesis is true (false positives).

Along with these main cautions, also be on the lookout for misinterpretations of the p-value and of the meaning of significance. A significant result tells you that the null hypothesis is a poor explanation for the data. A large p-value tells you that the null hypothesis is a reasonable explanation for the data.

A significance test or a p-value does not tell you the chance that the null hypothesis is correct or the chance that the alternative is correct. After all, it is calculated assuming the null. A significance test or p-value cannot tell us when a result is important in a practical sense. A significance test or p-value cannot tell you whether the methods used to gather the data were biased, thus creating differences where non-exist in the population. A small p-value does not tell you what aspect of a null hypothesis with multiple assumptions is causing the poor fit to the data (e.g., in the pizza study above, the assumption that the individual delivery times are independent may be substantially wrong). Beware of reports you see in the media that make any of these common misinterpretations of the results of a hypothesis test. Of course, that includes unwarranted claims of cause-and-effect. Finding significance implies that the null hypothesis provides a poor explanation of the data. But there may be many other potential explanations for the data besides a causal treatment effect – especially in an observational study.