Printer-friendly versionPrinter-friendly version

Except when diagnosing, we are not usually interested in the specific items (e.g. mice, tissue samples) we are measuring.  We use these units to represent some larger class.   For example, we might measure the immune response of 50 volunteers to a new vaccine with the objective of estimating the response if the vaccine is used later in a patient group. 

In statistical inference the population is the general class of objects (tissues, cells…) for which we want to make an inferential statement. The sample is the set of the observed items from the population.  If the members of the population were identical and if there were no measurement error, there would be no need for statistical inference.  Because there is both biological and technical variability, we need to carefully design our studies to control and quantify variability, and use appropriate inferential tools that leverage both the current study and previous information to produce valid and reproducible inference.

The idea behind statistical inference is that we will use the sample to make inference about the population. This means that the sample needs to be representative of the population and not biased. The most fundamental tool that we have for this is randomization.  Randomization means that our units are selected at random from the population and that any treatments are applied according to a randomization scheme.  The statistical analysis will then use the randomization information as part of the inferential procedure to quantify the variability. See reference [1].

Many of the analyses that we will do assume that the samples are independent of each other. However, correlation is induced both by biology and by our study conditions.  Examples of biological correlation are mice from the same litter (family correlation) and repeated measurements on the same mouse. Correlations are induced by what are sometimes called batch effects.  For example, if you and a collaborator each do a replicate of an experiment in your own labs, it is likely that the measurements taken within each lab are more similar than measurements taken in different labs.   These similarities might be induced by the lab environment, batches of feed, batches of reagents, and who actually handled the mice, took the samples, made the measurements, etc.  These similarities induce correlation called the intra-class correlation.  We often deliberately induce intraclass correlation by what is called blocking, as a means of controlling variability.  This works well as long as we record the blocks and make use of the information in the statistical analysis.  When the samples are correlated but we treat them as if they were independent, we can introduce large errors into the analysis which typically lead to a high error rate.