5.4 - Modeling Strategy

With three variables, there are nine possible models. The order here is from the most complex to simple.

Saturated (\(XYZ\))
Homogeneous associations \((XY, XZ, YZ)\)
Conditional independence (3 different models) (\(XY, XZ\)) (\(XY, YZ\)) (\(XZ, YZ\))
Joint independence (3 different models) (\(XY, Z\)) (\(XZ, Y\)) (\(X, XZ\))
Complete independence (\(X, Y, Z\))

Any model that lies below a given model is a special case of the more complex model(s). Such structure among models is known as hierarchical model structure. With real data, we may not want to fit all of these models but focus only on those that make sense. For example, suppose that \(Z\) (e.g. admission) can be regarded as a response variable, and \(X\) (e.g., sex) and \(Y\) (e.g., department) are predictors.

  • In regression, we do not model the relationships among predictors but allow arbitrary associations among them. Therefore, the simplest model that we may wish to fit is a null model \((XY, Z)\) which says that neither predictor is related to the response.
  • If the null model does not fit, then we should try \((XY, XZ)\), which says that \(X\) is related to \(Z\), but \(Y\) is not. As we will see later in the course, this is equivalent to a logistic regression for \(Z\) with a main effect for \(X\) but no effect for \(Y\).
  • We may also try \((XY, YZ)\), which is equivalent to a logistic regression for \(Z\) with a main effect for \(Y\) but no effect for \(X\).
  • If neither of those models fit, we may try the model of homogeneous association, \((XY, YZ, XZ)\), which is equivalent to a logistic regression for \(Z\) with main effects for \(X\) and for \(Y\) but no interaction.
  • The saturated model \((XYZ)\) is equivalent to a logistic regression for \(Z\) with a main effect for \(X\), a main effect for \(Y\), and an \(XY\) interaction.

We will learn more about model selection as we learn more about log-linear and logit models. As we consider more three-way and higher-way tables, models rather than single statistics or significance testing will give us a better sense of associations between variables and the nature of the dependencies.

Extensions to k-way tables are straightforward, and work by grouping. For example, consider six binary random variables: \(X_1,X_2,X_3,X_4,X_5\), and \(X_6\). These can be represented in a \(2^6\) table. The simplest model is the model of complete independence \(X_1,X_2,X_3,X_4,X_5,X_6\), and the saturated is the most complex one including all possible interactions. But now there are many more conditional and joint independence models that we can consider, but these can be reduced to three-way and two-way tables by grouping certain variables.

Let's say that now \(A=(X_1,X_2,)\), \(B=(X_3)\), and \(C=(X_4,X_5,X_6\)), and the joint independence model \((AB,C)\) says that \(X_1,X_2\), and \(X_3\) are jointly independent of \(X_4,X_5\), and \(X_6\). Basically, we can do the analysis of an \(AB \times C\), \(8\times 8\) table. We can also consider other joint and conditional independence models with other groupings of the variables. We will see more on model selection in the upcoming lessons.