A model relating the outcome variable to the prognosic factors and treatment effects is a construct that makes use of theoretical knowledge and empirical knowledge. In mathematical and statistical models, the theoretical component is represented by one or more equations and the empirical component is represented by data. The behavior of a model is governed by its structure or functional form and by the unknown quantities or constants (parameters). Objectives of the modeling exercise might include estimation of the parameters, determination of model fit, or efficient summarization of large amounts of data.

Dr. George Box once stated: "All models are wrong, but some are useful."

A linear model often is used for describing a continuous outcome variable. In this situation, "linear" refers to the fact that the deterministic component of the model is a linear combination of parameters and covariates. Statistical models typically contain a deterministic component and a random component.

An example is the linear model that is used for multiple regression. Let Y denote the outcome variable and \(X_1, X_2, \dots , X_K\) denote K different regressors (predictors) that are measured on each of n patients. Then the statistical model for patient \(i, i = 1, 2, \dots , n\), is

\(Y_i=\beta_0+\beta_1X_{1i}+\beta_0+\beta_2X_{2i} + \dots + \beta_K+\beta_1X_{Ki}+\epsilon_i \)

where \(\beta_0\) is the intercept, \(\beta_{1}, \beta_{2}, \dots , \beta_{K}\) are the slopes for the K regressors, and \(\epsilon_{i}\) represents the random error term for patient i.

In the multiple regression model, \(\beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \dots + \beta_{K}X_{Ki}\) represents the deterministic portion of the model for patient i and \(\epsilon_{i}\) represents the random error term for patient \(i, i = 1, 2, \dots , n\).

Typically, we assume that \(\epsilon_{1}, \epsilon_{2}, \dots , \epsilon_{n}\) are independent and identically distributed random variables, each following a \(N\left(0, \sigma^{2}\right)\) distribution.

A linear model for a one-way analysis of variance (ANOVA) is:

\(Y_{ij}=\mu_i + \epsilon_{ij}\)

where \(i = 1, 2, \dots , K\) denotes the \(i^{th}\) treatment group, \(j = 1, 2, \dots \), ni denotes the \(j^{th}\) patient within the \(i^{th}\) treatment group, \(\mu_{i}\) denotes the population mean for the \(i^{th}\) treatment group, and \(\epsilon_{ij}\) represents the random error term for the \(j^{th}\) patient within the \(i^{th}\) treatment group.

A linear model for a one-way analysis of covariance (ANCOVA) with three covariates is:

\(Y_{ij}=\mu_i+\beta_1X_{1ij}+\beta_2X_{2ij}+\beta_3X_{3ij}+\epsilon_{ij}\)

where the notation is similar to that for the one-way ANOVA with K treatment groups, and \(X_{1ij}, X_{2ij}, X_{3ij}\) denote the values of the three covariates for the \(j^{th}\) patient within the \(i^{th}\) treatment group.

The covariates in an ANCOVA model (the regressors in a multiple regression model) may be continuous, ordinal, or binary.

If a covariate is categorical with L levels, it might be necessary to recode it as L - 1 distinct covariates that are binary (called dummy variables). One way to do this is to select a reference level and let the dummy variables correspond to the remaining L - 1 levels.

For example, suppose that there are four centers in a multi-center trial and that it is desirable to model for center effects. The above ANCOVA model can be invoked with center #4 as the reference level:

\(X_{1ij} = 1\), if patient (i,j) is in center #1; 0 otherwise

\(X_{2ij} = 1\), if patient (i,j) is in center #2; 0 otherwise

\(X_{3ij} = 1\), if patient (i,j) is in center #3; 0 otherwise

The implications of the model are that \(\mu_{1}, \mu_{2}, \dots , \mu_{K}\) represent treatment means within the reference center (center #4). Patients within center #1 have treatment means \(\mu_{1} + \beta_{1}, \mu_{2} + \beta_{1} , \dots , \mu_{K} + \beta_{1}\), so that \(\beta_{1}\) represents the change in any treatment mean between center #4 and center #1.

Statistical software packages for multiple regression typically require the user to recode categorical regressors/covariates in this manner (SAS PROC REG), whereas the statistical software packages for ANOVA and ANCOVA can recode categorical regressors/covariates for the user (the CLASS statement in SAS PROC ANOVA and SAS PROC GLM).

New regressors/covariates can be constructed as interactions among other regressors. For example, suppose that \(X_{1}\) represents age and \(X_{2}\) represents serum cholesterol. A third regressor, \(X_{3} = X_{1} × X_{2}\), can be constructed as the product and might be important to include in the model if only old age in combination with high cholesterol has an impact on the outcome.

This example is called a first-order interaction and higher-order interactions can be constructed as products of more than two regressors. Of course, constructing more regressors in this manner can get unwieldy and lead to an unmanageable number of potential regressors to consider.

Treatment × covariate interactions are important to investigate in randomized and nonrandomized studies. Treatment × center interactions are important to investigate in multi-center trials.

If the treatment × covariate interactions are important, then it could be difficult to interpret main effects for treatment.

For example, suppose that in a two-armed trial (treatments A and B) baseline cholesterol level is considered an important covariate. Suppose that the treatment × cholesterol interactions are significant such that for low cholesterol levels treatment A is better than treatment B, but for high cholesterol levels treatment B is better than treatment A. Thus, it is not possible to conclude that one treatment is superior because the choice of the best treatment depends on baseline cholesterol levels, an important discovery in and of itself.

If the investigator focuses on a specific region of values for the covariate, such as high baseline cholesterol, then it may be possible to determine which treatment is superior in this region.

Sometimes it is possible to make general conclusions if the interactions are due to the magnitude of the effect.

For example, suppose treatment A is 20 units better than treatment B in the presence of low baseline serum cholesterol, but treatment A is 60 units better than treatment B in the presence of high baseline serum cholesterol levels. Even though there is significant treatment × covariate interactions, it still appears that treatment A is superior. This type of treatment × covariate interactions is called "quantitative."

Profile plots (mean outcome response versus the covariate) for each treatment group will indicate graphically whether the interactions are qualitative (not parallel and crossing, below),

or quantitative (not parallel but not crossing).