Lesson 11: Significance Testing Caveats & Ethics of Experiments

Lesson 11: Significance Testing Caveats & Ethics of Experiments

Lesson Overview

In Lesson 10 we learned that the purpose of a significance test is to answer the question "Does the null hypothesis provide a reasonable explanation for the data?" The question is addressed by carrying out a probability calculation assuming the null hypothesis is true.

With a small p-value we answer "No" - the null hypothesis is a poor explanation of the data.  

With a big p-value we answer "Yes" - the null hypothesis is a reasonable explanation of the data.

But, as with any statistical answer, we must understand the ways in which our answer could be wrong or misleading.

  • There is a small chance that the null hypothesis provides a poor explanation of the data even when it is true (a type 1 error).  
  • It is possible that the null hypothesis provides a reasonable explanation of the data even when it is false (a type 2 error).
  • A significant test can be misleading if the sample size is so large that significant result is of little practical importance.
  • A significant test can be misleading if the sample size is so small that an important effect goes undetected.
  • There can be a very high chance of at least one type 1 error occurring when many significance tests are carried out.

Finally, we remember the caveats from Lessons 2 and Lesson 3 - that misleading interpretations of any statistical results can occur if we do not gather data in a thoughtful and unbiased manner to make sure samples are representative of populations and to make sure groups are similar in nature for comparative studies.  Another requirement of the statistical design of a study is to ensure the ethical treatment of subjects. Animal subjects should not be subjected to needless pain & suffering.  The safety of human subjects must be paramount; they should be taking part voluntarily and informed of the nature of the experiment they are taking part in and the possible risks entailed. Further, these ethical concerns must trump other aspects of statistical design.


After successfully completing this lesson, you should:

  • Be able to identify the type 1 and the type 2 error in the context of the problem.
  • Be able to reason about the small sample caution: important effects may go undetected with a small sample
  • Be able to reason about the large sample caution: unimportant effects may be significant with a large sample
  • Be able to reason about the multiple testing problem: false positives are more likely to occur when doing many tests
  • Understand the importance of the elements of informed consent in human subjects research:
    • Disclosure
    • Capacity
    • Voluntary Participation

11.1 - Significance Testing Caveats

11.1 - Significance Testing Caveats

Here we take a look at the four principle caveats to watch out for when reading the results of a statistical hypothesis test: the large sample caution; the small sample caution; the multiple testing problem; and the misinterpretation problem.

Example 11.1: Pizza delivery times

Pizza in a box

When a pizza is ordered for delivery over the phone, the person answering the call will let the customer know how long to expect to wait before the pizza is delivered to their home. A study carried out in Columbus, Ohio examined the issue of whether the times given tend to overestimate how much time it will take to deliver. The researchers believed that overestimates were more likely than underestimates since the restaurants realize customers will be happier if a pizza is delivered early than if it is delivered late. In the study 198 pizzas were ordered over the period of one week at different restaurants at different times of day. The pizzas arrived an average of 3 minutes early with a standard deviation of 15 minutes. Were the average delivery times significantly early? Let's carry out the significance test:

  1. Step 1

    The parameter of interest is the true mean difference µ between the estimated and actually delivery times (estimated time – actual time) in minutes for all pizza stores. The hypotheses are null: \(\mu\) = 0 alternative: \(\mu\gt\)0 (estimated time is an overestimate)

  2. Step 2

    If the null hypothesis is true and the delivery times are independent then the average of 198 differences between estimated and actual delivery times would have a mean of 0 and a standard error of the mean given by \(15/ \sqrt{198} = 1.07\) minutes. Also, the average differences would closely follow the normal curve. We find the standard score to be z = (3-0) / 1.07 = 2.8.

  3. Step 3

    From the normal curve table we find the p-value to be about 1 - 0.997 = 0.003 or 0.3%.

  4. Step 4

    With such a puny p-value we conclude that the null hypothesis is a very poor explanation of the data. The conclusion: We have significant evidence that the estimated delivery times given over the phone are, on average, later than actual delivery times.

The results are indeed significant in the statistical sense (they can not be explained by random chance). But are they of any practical significance? Does a pizza arriving 3 minutes early have any practical consequences? Or would you consider the 3 minute average difference found in this study to be pretty close to the times given over the phone?

In this example the sample size of 198 pizza orders was quite large, leaving very little variability in the estimate of the mean value at question. With such little variability, even a small difference of no practical consequence is seen as statistically significant. That is the heart of the Large Sample Caution.

 The Large Sample Caution:

With a sufficiently large sample size, one can detect the smallest of departures from the null hypothesis. For studies with large sample sizes, ask yourself if the magnitude of the observed difference from the null hypothesis is of any practical importance.

Example 11.2: Treating Epilepsy in Rural India

A study in the journal Lancet reported on a randomized controlled experiment comparing the use of Phenobarbital with Phenytoin for childhood epilepsy in rural India. Because of its low cost, Phenobarbital is recommended by WHO for treating epilepsy in developing countries. This is controversial because of previously reported behavioral side effects. In this study, behavioral problems did not occur at a significantly lower rate in the Phenytoin group and the authors concluded: "This evidence supports the acceptability of Phenobarbital as a first line drug for childhood epilepsy in rural settings in developing countries."

However, there were only 47 patients in each group and because of missing data, many comparisons were based on only 32 patients per group. The standard error for the difference between proportions in groups that size is about 0.125 and results would not be found to be significant unless the difference seen in a study was twice as large. Thus, it is clear that the author's conclusion in this research report is not justified by the evidence. Because of the small sample sizes, even important differences between Phenobarbital and Phenytoin could easily go undetected. This is the heart of the Small Sample Caution.

 The Small Sample Caution:

For very small sample sizes, a very large departure of the sample results from the null hypothesis may not be statistically significant (although it may be of practical concern). This should motivate one to do a better study with a larger sample size.

Example 11.3: If you want a boy, eat your cereal

A 2008 study in the British journal the Proceedings of the Royal Society, Biological Sciences, found a significant relationship between how much breakfast cereal a woman eats and whether she has male children. Among 740 British women, they found that women in the top third of cereal eaters had 56% male children while the women in the lowest third of cereal eaters had only 45% male children. But it turns out that 132 different food items were examined so finding some with highly significant results should not come as a surprise. After all, even when the null hypothesis is true there is a 1% chance of getting a p-value less than 1% and declaring the result highly significant (that comes directly from the definition of the p-value). So if you look at 132 significance tests, finding one that is highly significant is very much expected. This is the heart of the Multiple Testing Caution.

 The Multiple Testing Caution:

When a large number of significance tests are conducted, some individual tests may be deemed significant just by chance even if the null hypothesis is true (false positives).

Along with these main cautions, also be on the lookout for misinterpretations of the p-value and of the meaning of significance. A significant result tells you that the null hypothesis is a poor explanation for the data. A large p-value tells you that the null hypothesis is a reasonable explanation for the data.

A significance test or a p-value does not tell you the chance that the null hypothesis is correct or the chance that the alternative is correct. After all, it is calculated assuming the null. A significance test or p-value cannot tell us when a result is important in a practical sense. A significance test or p-value cannot tell you whether the methods used to gather the data were biased, thus creating differences where non-exist in the population. A small p-value does not tell you what aspect of a null hypothesis with multiple assumptions is causing the poor fit to the data (e.g., in the pizza study above, the assumption that the individual delivery times are independent may be substantially wrong). Beware of reports you see in the media that make any of these common misinterpretations of the results of a hypothesis test. Of course, that includes unwarranted claims of cause-and-effect. Finding significance implies that the null hypothesis provides a poor explanation of the data. But there may be many other potential explanations for the data besides a causal treatment effect – especially in an observational study.

11.2 - Experimental Ethics

11.2 - Experimental Ethics

Clearly, the ethical treatment of animals and humans taking part in experimental research is a moral obligation of every researcher. But scientists are typically not formally trained to fully appreciate some of the finer points that must be handled in order to avoid unintended physical or (for humans) psychological harm.  As a result, guidelines for the treatment of human and animal participants in research studies have been developed at the national level and implemented with great care at the local level throughout the country.  

For animal experiments, the guidelines focus on maintaining clean caging and appropriate feeding and most importantly on avoiding unnecessary pain and suffering.  For example, in studies using mice to study treatments for cancer, tumors are not allowed to grow beyond the point that causes the animal to suffer.  Such guidelines become much stricter than this bare-bones standard when larger mammals are used.  Animal research remains a controversial practice with some believing such research should be banned entirely and others arguing that the benefits to humans provides the ethical underpinning for this type of research.

With human subjects, the guidelines are precise and strict and based on an ethical consensus developed over many decades - partly in reaction to questionable experimentation done in the past. 

Institutional Review Boards

Each institution receiving federal money to conduct research must establish an Institutional Review Board (IRB) made up of a diverse group of scientists and community members to evaluate the ethical conduct of research conducted there (Institutions not receiving federal dollars must still follow the basic guidelines and make alternate arrangements for an independent evaluation of their research practices). The IRB is charged with ensuring that every research study must be planned:

  • To avoid physical or psychological risks to subjects and to be sure those risks are reasonable given the potential benefits to the subject and to the importance of the knowledge gained by the experiment.
  • To ensure the voluntary nature of participation.
  • To ensure that the risks and benefits of research are shared equitably by different groups in society.
  • To protect vulnerable populations who may not have the full capacity to balance the risks and benefits to themselves such as children and the mentally disabled.
  • To be sure that Informed Consent is given by every subject that describes in plain English
    • the research and what is requested of the subjects,
    • anticipated risks and potential benefits,
    • alternatives to participation,
    • provisions maintaining the subject’s privacy and confidentiality of records, and
    • the right to leave the study at any time without any detriment to the subject.

The Informed Consent must be documented.

From time-to-time there might be a conflict between maintaining these ethical standards and what might constitute the best way to gain scientific knowledge.  In such cases the ethical considerations must always take precedence.

Example 11.4: Ibuprofen Cream Study

Ibuprofen is a leading analgesic and is sold over the counter in a variety of forms to take orally (e.g., in capsules, tablets, and as a liquid). In Europe, Ibuprofen is also available as a gel or cream to apply topically to a sore area, but this form has not yet been approved by the Food and Drug Administration (FDA) in the United States. How should a research study be designed to demonstrate the effectiveness of ibuprofen cream in reducing muscle soreness? 

The Statistical Design

The study involved a resistance exercise of the type commonly done in a gym or health club. Previous work by researchers at the University of Massachusetts provided a detailed model for the time course of muscle function and other physiologic reactions to temporary muscle damage image of an armdue to strenuous exercise. The soreness will reach its peak between 36 hours and 72 hours post-exercise and will subside within a couple of days. Since "soreness" is a difficult concept to quantify, several response variables were used (e.g. a Visual Analog Scale where subjects are asked to point to a spot along a line labeled 0 at one end and 100 at the other corresponding to their level of pain/soreness and told  0 = no pain and 100 = excruciating pain). The study used a matched pairs randomized controlled double-blind strategy. An arm was randomly selected and used in the exercise regime. 48 hours later either the ibuprofen cream or a placebo cream was randomly selected to be applied to the sore muscle in the arm. Soreness levels were measured pre-treatment and every hour for several hours and then each day for several days after treatment.  Several weeks later, the opposite arm was exercised and the opposite treatment was used. The analysis could then compare the time course of soreness relief within the same subject. Altogether, full data were obtained on 106 subjects.

Adding Ethical Considerations 

Informed Consent

The IRB at the University of Massachusetts Amherst approved the written informed consent document used in the ibuprofen cream study. Note the importance for the IRB to have general community representatives on the board to be sure that the instructions and explanations given in the informed consent document were written in plain English and free of scientific jargon. The IRB needed to be sure that the subjects knew what they were volunteering to do: complete an exercise program that would leave them sore - perhaps even very sore - for a few days with little benefit to themselves but of possible benefit to society.

Avoiding Coercion

To increase recruitment, many researchers would like to provide a monetary incentive for participation.  But this can create a coercive effect, especially on poorer subjects who may wish to end their participation in a research project but can not because of worries about losing the financial incentive. Large incentives violate the ethical principle of participation being completely voluntary.  In the ibuprofen trial, subjects were reimbursed for all expenses and provided with meals during their time being measured. As part of being recruited, they also received a free exercise program at a gym (but signing up for the experiment was not a requirement of receiving that benefit) and they received a small amount of money for participating (but could keep that money regardless of whether they decided to leave the experiment early or not).

Avoiding Harm to Subjects

In the ibuprofen study subjects, over 45 years old were given a physical examination by a doctor to be sure they were inadequate health to participate in an exercise regime.  All subjects were evaluated for conditions that might make the experiment a danger to them. For example, people with skin conditions or allergies that might be affected by the cream were not asked to participate and women would not be allowed to participate if they were pregnant. A follow-up visit ten days after the exercise regime was provided to be sure that all subjects were completely pain-free at that time and without any adverse residual effects of having participated. Finally, the exercise regime itself was designed to produce only moderate soreness.  If the regime created severe pain then the subject would be treated by a doctor immediately and would not be randomized into one of the experimental conditions (this did not happen to any of the subjects taking part - but such backup plans are important to experimental planning).

Ensuring Voluntarism of Participation

As noted above, all researchers must let the subjects know that they can withdraw from participation at any time and for any reason without prejudice. The researchers in the ibuprofen trial added that if a subject decided not to continue their participation for any reason, they would still be provided with free medical care (if they desired) to help them alleviate any residual discomfort that might have been caused by their initial participation.  

11.3 - Test Yourself!

11.3 - Test Yourself!

Think About It!

Select the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.

11.4 - Have Fun With It!

11.4 - Have Fun With It!

Have Fun With It!

animated cartoon about validation, "We test thousands of new treatments each year, so to avoid multiple=

J.B. Landers ©

Has Tooltip/Popover
 Toggleable Visibility