Lesson 7: Linear Models for Differential Expression in Microarray Studies

Printer-friendly versionPrinter-friendly version
Key Learning Goals for this Lesson:
  • Understanding why we use statistical models
  • Understanding the linear model
  • Setting up the design matrix

Introduction

Linear models are among the most used statistical methods.   T-tests, ANOVA, ANCOVA and regression can all be formulated as special cases of linear models.  As well, a set of models called generalized linear models are - no surprise given the name - generalizations of the linear model and are also widely used for modeling and analysis.

In most sciences we have model systems that are simplifications of the complex systems that we would like to work with or which provide us with a system in which we can make manipulations which would not be possible in the "real" system.  For example, we use mouse models for human diseases because we can create inbred lines (which are much simpler than human lineages) and we can impose conditions that would be difficult or unethical to impose on human subjects, such as special diets or exposures to pathogens.

A mathematical model is also a simplification.  We can build quite complex models, for example of protein interaction networks, that allow us to set parameters and then simulate how the system behaves.  We then usually want to determine if the "real" system behaves in the same way.

Statistical models are often less refined than mathematical models in defining the mechanisms of a system, but include stochastic (random) terms that allow us to make probability statements.  This allows us to do statistical testing and compute predictions and confidence intervals.  Once we have a model for a system, the model can be used to determine appropriate study designs, sample sizes, etc.

Statistical models essentially have two components, the systematic or "fixed effects" component that describes the deterministic part of the system and the random component that describes the random parts of the system including biological and technical variation.  Usually, we are interested in estimating or testing the systematic effects caused by conditions such as genotype, developmental stage, environmental exposure, etc.

Every model includes some assumptions.  In general, the more data we have, the fewer assumptions we need to make, as we can use the data to estimate more model parameters.  For example, the model for the 2-sample t-test with pooled variance states that the samples have different means but the same variance.  If both samples are sufficiently large, we can use Welch's t-test which allows the samples to have different means and different variances. Another assumption of the t-test is that each sample comes from a population that is close to Normal.  

Sometimes we can manipulate the model to more closely resemble the data and sometimes we manipulate the data to more closely resemble the model.  For example, in linear models we usually assume that the noise is Normally distributed.  Gene expression data is usually skewed  - taking logarithms of the data tend to make the noise more symmetric and hence closer to Normal.  In some cases we may do a sensitivity analysis of the model to determine sensitivity to violations of the model assumptions.  For example, this type of analysis tells us that the t-test is quite insensitive (robust) to non-Normality as long as the data are not too skewed, but is quite sensitive to skewness.  

Formulating a model can also guide the study design and the analysis of the resulting data.  For example, when we set up a study to determine if there are "treatment differences", our intention to use a t-test implies a model in which "treatment difference" actually implies "difference in treatment means" rather than e.g. differences in variability or skewness.  We use the model to determine the sample size required to achieve the desired power.  And of course, after the data are collected, we then use the t-test for the analysis, possibly after checking that the data are "close to Normal".

The linear model is one of the simplest models used in statistics.  It encompasses some models that you do not usually think of as "linear" such as ANOVA and polynomial trends.  In this Chapter we will learn more about linear models and how to set up a linear model for statistical analyses in R.