3: Two-Way Tables: Independence and Association

3: Two-Way Tables: Independence and Association


This lesson is about the analysis of two-way tables. We begin with the structure of the simplest two-way table, a \(2\times 2\) table, and its corresponding joint distribution. We then consider how to extend the goodness-of-fit tests that we saw in Lesson 2. More specifically, we will learn the Chi-squared test of independence for two discrete binary random variables, and basic measures of associations such as odds-ratios and relative risk that will describe the strength of association between two binary random variables. We will then extend these concepts to an \(I \times J\) table, to analyze two discrete random variables with \(I\) and \(J\) categories.

Upon completion of this lesson, you should be able to:

No objectives have been defined for this lesson yet.

3.1 - Notation & Structure

3.1 - Notation & Structure

When collecting data on two categorical variables, we can easily summarize the responses in the form of a table with the levels of one variable corresponding to the rows, the levels of the other variable corresponding to the columns, and the count of individuals answering accordingly in each cell. Specifically, we'll frequently use the following terms:

Two-way contingency table
A two-way contingency table is a cross-classification of observations by the levels of two discrete variables.
The cells of the table contain frequency count.
The dimension of the table is determined by the number of variables.
The size of the table refers to the number of cells. For example, a 2-dimensional (2-way) table of size \(2\times2\), is a cross-classification table of two discrete variables, each with two levels, having a total of 4 cells.

Example: Therapeutic Value of Vitamin C


This is an example of a double-blind study investigating the therapeutic value of vitamin C (ascorbic acid) for treating common colds. The study was conducted during a 2 week period on a sample of 280 French skiers, but one observation had to be dropped. There are two discrete variables each having two levels - hence the two-way table.

Table 1: Incidence of Common Colds involving French Skiers (Pauling, 1971) as reported in Fienberg (1980).

  Cold No Cold Totals
Placebo 31 109 140
Ascorbic Acid 17 122 139
Totals 48 231 279

Each cell indicates levels of both traits. For example, 31 skiers were given a placebo and contracted cold while 109 did not. Here is the same data, presented in a \(2\times2\) table using sample proportions instead.

Table 2: Incidence of Common Colds involving French Skiers (Pauling, 1971) as reported in Fienberg (1980).

  Cold No Cold Totals
Placebo 0.111 0.391 0.502
Ascorbic Acid 0.061 0.437 0.498
Totals 0.172 0.828 1

Here are some questions that we may ask regarding this data.

Is the probability that a member of the placebo group contracts a cold the same as the probability that a member of the ascorbic group contracts a cold?

 Are the type of treatment and cold status associated or independent? Here, independence means that having a placebo or ascorbic acid has no relationship with having a cold or otherwise.

 What are the odds of getting a cold for those taking ascorbic acid (vitamin C)?

Example: Coronary Heart Disease

This is an example of a \(2\times4\) table. The data below are taken from the Framingham longitudinal study of coronary heart disease (Cornfield, 1962). In this study, \(n=1329\) patients were classified by serum cholesterol level (mg/100 cc) and whether they had been diagnosed with coronary heart disease (CHD).

  0–199 200–219 220–259 260+ total
CHD 12 8 31 41 92
no CHD 307 246 439 245 1237
total 319 254 470 286 1329

One variable is binary and has two outcomes. The other variable has four levels - this is most likely a continuous variable, but it has been grouped into four intervals.

 Stop and Think!
What would you call the variable Cholesterol: nominal, ordinal, or interval?

For the Coronary Heart Disease example, the variable along the top row of the table is the amount of Total Cholesterol. This was originally a continuous variable but is now an interval variable because it was broken up into intervals, but it is also ordinal because there is a natural order and progression in the levels of this variable.


 Is there any evidence of a relationship/association between cholesterol level and heart disease?

Example: Smoking

This is an example of a \(3\times2\) table. The table below classifies 5375 high school students according to the smoking behavior of the student \(Z\) and the smoking behavior of the student’s parents \(Y\).

  Student smokes?
How many parents smoke? Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168
 Stop and Think!
Would you call the row variable (how many parents smoke) nominal or ordinal?

By default, the row variable is ordinal because there is a natural progression in this variable because the number of parents smoking increases from 1 to 2 to both. But if you are not interested in the ordinality you can treat it as nominal.


 Question: Is there a relationship of smoking behavior between the students and their parents?

Suppose that we collect data on two binary variables, \(Y\) and \(Z\), for example, "treatment" and "contracting cold" for \(n\) sample units. Binary means that these variables take two possible values say 1 (e.g. "cold") and 2 (e.g. "no cold").

\(Y\), taking possible values \(i = 1, \ldots, I\), where \(I = 2\),
\(Z\), taking possible values \(j = 1, \ldots, J\), where \(J = 2\).

The data then consist of \(n\) pairs,

\((y_1, z_1), (y_2, z_2), \ldots , (y_n, z_n)\)

which can be summarized in a frequency table.

Let \(n_{ij}\) be the number of subjects having the following characteristics \((Y = i, Z = j)\) (that is, the number of subjects falling into a particular cell of the two-way table, more specifically falling into the \(i\)th level of \(Y\) and the \(j\)th level of \(Z\)). The total sample size is \(\sum_{i=1}^I\sum_{j=1}^J n_{ij}=n\) . The levels of the first variable are represented by the index \(i\) and the levels of the second variable by index \(j\).

For the Vitamin C example data, \(n_{11} = 31\) means that in our sample we observed 31 individuals who took a placebo pill and got the cold. The counts may be arranged in a \(2\times2\) table:

  Z = 1 Z = 2
Y = 1 \(n_{11}\) \(n_{12}\)
Y = 2 \(n_{21}\) \(n_{22}\)

The total number of cells in the table is denoted as \(n=IJ\), which is 4 in this case. In some textbooks the authors will use \(x_{ij}\) instead of \(n_{ij}\).

The observed table \(x = (n_{11}, n_{12}, n_{21}, n_{22})\) is a summary of all \(n\) responses, e.g., the values of four counts of the Vitamin C example out of 279 total responses/individuals. We could display a contingency table X as a one-way table with four cells, but it is customary to display \(X\) as a two-dimensional table with the separate row and column variables as above. Let's see what are some other important structural elements of such tables.

Marginal Totals

When a subscript in a cell count \(n_{ij}\) is replaced by a plus sign (+) or a dot (.), it will mean that we have taken the sum of the cell counts over that subscript.

The row totals are

\(n_{1+} = n_{11} + n_{12}\)
\(n_{2+} = n_{21} + n_{22}\)

the column totals are

\(n_{+1} = n_{11} + n_{21}\)
\(n_{+2} = n_{12} + n22\)

and the grand total is \(n_{++} = n_{11} + n_{12} + n_{21} + n_{22} = n\).

These quantities are often called marginal totals, because they are conveniently placed in the margins of the table, like this:

  Z = 1 Z = 2 total
Y = 1 \(n_{11}\) \(n_{12}\) \(n_{1+}\)
Y = 2 \(n_{21}\) \(n_{22}\) \(n_{2+}\)
total \(n_{+1}\) \(n_{+2}\) \(n_{++}\)

For example, the marginal totals for the Vitamin C data are \(n_{1+} = 140\), and \(n_{2+} = 139\).

Joint Distribution

If the sample units are randomly sampled from a large population, then the observed table \(x = (n_{11}, n_{12}, n_{21}, n_{22})\) will have a multinomial distribution with index \(n = n_{++}\) and a parameter vector

\(\boldsymbol{\pi} =(\pi_{11},\pi_{12},\pi_{21},\pi_{22}) =\{\pi_{ij}\}\)

where \(\pi_{ij} = P (Y = i, Z = j)\) is the probability that a randomly selected individual in the population of interest falls into the \((i, j)\)th cell of the contingency table, that is, into the \(i\)th level of \(Y\) and \(j\)th level of \(Z\).

  Z = 1 Z = 2 total
Y = 1 \(\pi_{11}\) \(\pi_{12}\) \(\pi_{1+}\)
Y = 2 \(\pi_{21}\) \(\pi_{22}\) \(\pi_{2+}\)
total \(\pi_{+1}\) \(\pi_{+2}\) \(\pi_{++} = 1\)

For observed data, we may also use \(p\) instead of \(\hat{\pi}\) to represent a sample proportion. That is, \(p_{ij}=\frac{n_{ij}}{n}\) is the sample proportion of observations in the \((i, j)\)th cell.

Marginal and Conditional Distributions

If we sum the joint probabilities over one variable, we get the marginal distribution. For example, the probability distribution \({\pi_{i+}}\) is the marginal distribution for \(Y\) where \(P(Y = 1) = \pi_{1+}\) and \(P(Y = 2) = \pi_{2+}\) and \(\pi_{1+} + \pi_{2+} =1\). Then the observed marginal distribution of \(Y\) is \({p_{i+}}\). For the Vitamin C data, the observed marginal distribution of type of treatment is

\(p_{1+}= \dfrac{n_{1+}}{n} = \dfrac{140}{279} = 0.502\) and \(p_{2+}= \dfrac{n2+}{n} = \dfrac{139}{279} = 0.498\)

Think About it!

What is the marginal distribution of \(Z\) in the Vitamin C example? What is the observed marginal distribution of \(Z\) for the same example?

The conditional probability distribution is a probability of one variable given the values of other variable(s). For example, the conditional distribution of \(Z\), given the values of \(Y\) , is \(\pi_{j|I=i}=\frac{\pi_{ij}}{\pi_{i+}}\), such that \(\sum_j \pi_{j|I=i} = 1\). Intuitively, we're asking how the distribution of \(Z\) changes as the categories of \(Y\) change.

Here are the observed conditional probability distributions of \(Z\), given \(Y\).

  Z = 1 Z = 2 total
Y = 1 \(\dfrac{n_{11}}{n_{1+}}= p_{1|1}\) \(\dfrac{n_{12}}{n_{1+}}= p_{2|1}\) 1
Y = 2 \(\dfrac{n21}{n2+}= p_{1|2}\) \(\dfrac{n_{22}}{n_{2+}}= p_{2|2}\) 1

Let's see what this means for the distribution of cold, given treatment. There are two conditional probability distributions, depending on whether the treatment is "vitamin" or "placebo". For the subjects receiving the placebo treatment, we have \(P(\mbox{"yes"} | \mbox{"placebo"}) =31/140\) and \(P(\mbox{"no"} | \mbox{"placebo"}) =109/140\). Notice that these two values necessarily add to 1. Similarly, for the vitamin treatment, we have \(P(\mbox{"yes"} | \mbox{"vitamin"}) =17/139\) and \(P(\mbox{"no"} | \mbox{"vitamin"}) =122/139\).

Stop and Think!

What is the observed conditional probability distribution of treatment, given cold for the Vitamin C data?

Notation extension to any \(I \times J\) table

For general \(Y\) and \(Z\), the counts are usually arranged in a two-way table:

\(\begin{array}{c|cccc|} & Z=1 & Z=2 & \cdots & Z=J \\ \hline Y=1 & n_{11} & n_{12} & \cdots & n_{1 J} \\ Y=2 & n_{21} & n_{22} & \cdots & n_{2 J} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Y=I & n_{I 1} & n_{I 2} & \cdots & n_{I J} \\ \hline \end{array} \)

The total number of cells is \(n = I\times J\), and the marginal totals are:

\(n_{i+}=\sum\limits_{j=1}^J n_{ij},\qquad i=1,\ldots,I \)

\(n_{+j}=\sum\limits_{i=1}^I n_{ij},\qquad j=1,\ldots,J \)

\(n_{++}=\sum\limits_{i=1}^I \sum\limits_{j=1}^J n_{ij}=n \)

If the sample units are randomly selected from a large population, we can assume that the cell counts \(\left(n_{11}, \dots , n_{IJ}\right)\) have a multinomial distribution with index \(n_{++} = n\) and parameters

\(\pi = (\pi_{11}, \dots, \pi_{IJ})\)

This is the general multinomial model, and it is often called the saturated model, because it contains the maximum number of unknown parameters. There are \( I\times J\) unknown parameters (elements) in the vector \(\pi\) but because the elements of \(\pi\) must sum to one since this is a probability distribution, then there are really \( I\times J - 1\) unknown parameters that we need to estimate.

3.2 - Sampling Schemes

3.2 - Sampling Schemes

Generating two-way tables of counts is similar to generating one-way tables of counts but with a higher degree of complexity. The main generating probability mechanisms are Poisson, Binomial, and Multinomial models, but for two-way tables, the margins play a big role. We will discuss the following sampling schemes:

  • Unrestricted sampling (Poisson)
  • Sampling with fixed total sample size (Multinomial)
  • Sampling with fixed certain marginal totals (Product-Multinomial, Hypergeometric)

These sampling models extend to higher-dimensional tables as we will see in the later lessons.

Poisson Sampling

Think of collecting data until the sundown on the ski-slopes of a French ski-resort regarding whether the skiers have taken placebo or Vitamin C and whether they have a cold or not. As one skier is encountered, we ask him/her about these two variables of interest. Or, consider standing at the crossing of two major thoroughfares of a Midwest town and classifying each passing car whether it is driven by a male or a female and whether the car is an American make or not. In both examples, the total sample size is completely random and so are both margins of the table. Hence, each cell is considered an independent Poisson variable. The cell counts will follow a Poisson distribution

\(n_{ij}\sim Poisson(\lambda_{ij})\)

independently for \( i = 1, \ldots , I\) and \(j = 1, \ldots , J\). In this scheme, the overall \( n\) is not fixed. \(\lambda_{ij}\) is the parameter describing the rate of occurrence for the \((i, j)\)th cell. The expected mean and the variance of the cell are



We can think of rewriting the two-way table as a one-way table; e.g. we can think of \(\lambda_{ij}\) as the \(\lambda_{i}\) we saw in one-way frequency tables.

 Question: Are the rates in different cells the same or different for different cells?

It depends on the context of the problem and the model assumptions; we could assume the same underlying \(\lambda\) or different ones.

NOTE: If \(n_{++}\) is not of interest, Poisson data may be analyzed as if it were multinomial. Conversely, if data are multinomial, we may analyze them as if they were Poisson. The inferences for \(\pi\) are valid, and the inferences for \(n_{++}\) may be ignored. In this instance, we think of every cell as a separate Poisson variable. Convenience dictates which sampling scheme will be used in which set-up. This is because the likelihood function based on \(n_{ij}\sim Poisson(\lambda_{ij})\) may be factored into the product of a Poisson likelihood for \(n\),

\(n\sim Poisson(\lambda_{++})\)

and a multinomial likelihood for \({n_{ij}}\) given \(n\) , with parameters


Here, the total \(n\) provides no information about \(\pi = \pi_{ij}\). From a likelihood standpoint, we get the same inferences about \(\pi\) whether \(n\) is regarded as fixed or random.

Multinomial Sampling

Consider now collecting data on a predetermined number of individuals (e.g., 279 in the Vitamin C example) and classifying them according to two binary variables (e.g., treatment and response). If we draw a sample of \(n\) subjects from a population and record \((Y, Z)\) for each subject, then the joint distribution of \({n_{ij}}\) is multinomial with index \(n\) and parameter \(\pi = \pi_{ij}\),


where the grand total \(n\) is fixed and known. Parameters are functions of the cell means:


Question: Think of rewriting a two-way table as a one-way frequency table. How would you do this for the Vitamin C example?

An extension of this case occurs when, instead of fixing the total \(n\), either row totals OR the column totals are assumed fixed, as we will see next.

Product Multinomial Sampling

Consider now collecting data on 140 "placebo" and 139 "vitamin c" individuals and classifying them according to the response (e.g., if they got the a cold or not). Here, data are collected on a predetermined number of individuals for each category of one variable, and both sets are classified according to the levels of the other variable of interest. Hence, one margin is fixed by design while the other is free to vary. This type of sampling is called Independent Multinomial Sampling. If the response variable has only two levels, it is also called Independent Binomial Sampling, which is a special case of independent multinomial sampling.

If we decide beforehand that we will draw \(n_{i+}\) subjects with characteristic \(Y = i (i = 1, \ldots , I)\) and record the \(Z\)-value for each one, each row of the table \((n_{i1}, n_{i2}, \ldots , n_{iJ})\) is then multinomial with probabilities \(\pi_{j|i} = \dfrac{\pi_{ij}}{\pi_{i+}}\), and the rows are independent. The full likelihood is obtained by taking the product of the individual multinomial PMFs and therefore is known as product-multinomial sampling scheme.

Viewing the data as product-multinomial is appropriate when the row totals truly are fixed by design, as in

  • stratified random sampling (strata defined by \(Y\) )
  • an experiment where \(Y =\) treatment group

It’s also appropriate when the row totals are not fixed, but we are interested in inference on \(P(Z | Y)\) and not \(P(Y)\). That is, when \(Z\) is the outcome of interest, and \(Y\) is an explanatory variable that we do not wish to model.

Suppose the data are multinomial. Then we may factor the likelihood into two parts:

  • a multinomial likelihood for the row totals \((n_{1+}, n_{2+}, \ldots , n_{I+})\) with index \(n\) and parameter \({\pi_{i+}}\)
  • independent multinomial likelihoods for the rows, \((n_{i1} , n_{i2} , \ldots , n_{iJ} )\) with parameters \({\pi_{j|i} = \dfrac{\pi_{ij}}{\pi_{i+}}}\).

Therefore, if the parameters of interest can be expressed as functions only of the \(\pi_{j|i}\)’s and not the \(\pi_{i+}\)’s, then correct likelihood-based inferences may be obtained by treating the data as if they were product-multinomial. Conversely, if the data are product-multinomial, then correct likelihood-based inferences about functions of the \(\pi_{j|i}\)s will be obtained if we analyze the data as if they were multinomial. We may also treat them as Poisson, ignoring any inferences about \(n_{++}\) or \(n_{i+}\).

Hypergeometric Sampling

We may encounter data where both the row totals \((n_{1+}, . . . , n_{I+})\) and the column totals \((n_{+1}, . . . , n_{+J} )\) are fixed by design. The best-known example of this is Fisher’s hypothetical example of the Lady Tasting Tea, which will be discussed in the section on Exact Tests. Even when both sets of marginal totals are not fixed by design, some statisticians like to condition on them and perform "exact" inference when the sample size is small and asymptotic approximations are unlikely to work well.

In a \(2\times2\) table, the resulting sampling distribution is hypergeometric, which we introduced in Lesson 1. Recall, that the hypergeometric distribution describes the probability that in a sample of \(n\) distinctive units drawn from a finite population of size \(N\) without replacement, there are \(k\) successes. Consider that you draw \(n\) balls from a box with red and blue balls, where there is a total of \(N\) balls. There are a total of \(D\) red balls in the box. What's the probability that we will get exactly \(k\) red balls?

  draw no draw total
red k D-k D
blue n-k N+k-n-D N-D
total n N-n N

\(P(k,N,D,n)=\dfrac{\binom{D}{k} \binom{N-D}{n-k}}{\binom{N}{n}},\qquad k=0,1,2,\ldots,n\)

3.3 - Test for Independence

3.3 - Test for Independence

Two-way contingency tables show the behavior (distribution) of two discrete variables. Given an \(I\times J\) table, it is therefore natural to ask if and how are \(Y\) and \(Z\) related?

Suppose for the moment that there is no relationship between \(Y\) and \(Z\), i.e., they are independent.

Statistical Independence

By statistical independence, we mean that the joint probabilities are equal to the product of the marginal probabilities:


for all pairs of \(i, j\). If the data are independent and we know the marginal totals, then we can determine the exact probabilities for each of the cells. Independence may be expressed in other forms also, but for now, we will use this definition. Intuitively, independence means that the behavior of one variable, say \(Y\), will not be impacted by the behavior of \(Z\). In the Vitamin C example, independence means that whether or not a skier takes vitamin C has nothing to do with whether or not he/she has a cold.

Chi-Squared Test of Independence

The hypothesis of independence can be tested using the general method of goodness-of-fit test described earlier.

\(H_0\): the independence model is true i.e. \(\pi_{ij} = \pi_{i+}\pi_{+j}\) for all pairs of \((i, j)\)

versus the alternative

\(H_A\): the saturated model is true, i.e. \(\pi_{ij} \ne \pi_{i+}\pi_{+j}\) for at least one pair of \((i, j)\)

  1. Step 1

    calculate expected counts under the independence model

  2. Step 2

    compare the expected counts \(E_{ij}\) to the observed counts \(O_{ij}\)

  3. Step 3

    calculate \(X^2\) and/or \(G^2\) for testing the hypothesis of independence, and compare the values to the appropriate chi-squared distribution with correct df \((I-1)(J-1)\)

Before we see how to do this in R and SAS, let's see more about the saturated model and independence model in \(I\times J\) tables.

Saturated Model

Suppose that the cell counts \(n_{11}, \dots , n_{IJ}\) have a multinomial distribution with index \(n_{++} = n\) and parameters


(these results also work for Poisson or product-multinomial sampling). If the saturated model is true, the number of unknown parameters we need to estimate is maximal (just like in one-way tables) and it is equal to the number of cells in the table. But because the elements of \(\pi\) must sum to one, the saturated model actually has \(IJ − 1\) free parameters. For example if we had a \(3\times5\) table, we would have \(15 - 1 = 14\) unique parameters \(\pi_{ij}\)s that need to be estimated.

Independence Model

If the two variables Y and Z are independent, then \(\pi\) has a special form. Let

\(\pi_{i+} = P(Y = i), i = 1, 2, \ldots , I \)
\(\pi_{+j} = P(Z = j), j = 1, 2, \ldots, J\)

Note that \(\sum\limits_{i=1}^I \pi_{i+}=\sum\limits_{j=1}^J \pi_{+j}=1\), so the vectors \(\pi_{i+} = (\pi_{1+}, \pi_{2+}, . . . , \pi_{I+})\) and \(\pi_{+j} = (\pi_{+1}, \pi_{+2}, \ldots , \pi_{+J} )\) representing the marginal distributions (row probabilities and column probabilities) containing \( I − 1\) and \(J − 1\) unknown parameters, respectively.

If \(Y\) and \(Z\) are independent, then any element of \(\pi\) can be expressed as


Thus, under independence, \(\pi\) is a function of \((I − 1) + (J − 1)\) unknown parameters.

The parameters under the independence model can be estimated as follows. Note that the vector of row sums (observed marginal counts for the row variable \(Y\)), \((n_{1+},n_{2+},\ldots,n_{I+})\) has a multinomial distribution with index \(n\) and parameter \(\pi_{i+}\). The vector of column sums (observed marginal counts for the column variable \(Z\)), \((n_{+1},n_{+2},\ldots,n_{+J})\) has a multinomial distribution with index \(n = n_{++}\) and parameter \(\pi_{+j}\). The elements of \(\pi_{i+}\) and \(\pi_{+j}\) can thus be estimated by the sample proportions in each margin:

\(\hat{\pi}_{i+}=n_{i+}/n_{++},\qquad i=1,2,\ldots,I\)


\(\hat{\pi}_{+j}=n_{+j}/n_{++},\qquad j=1,2,\ldots,J\)


Then, the estimated expected cell frequencies under the independence model are


which can be remembered as

expected frequency = row total × column total / grand total .

The independence model is a sub-model of (i.e., a special case of) the saturated model. To test if the independence model fits, we compare the expected to the observed counts, and more formally, we do this by computing \(X^2\) or \(G^2\) statistics.

Pearson and Deviance Statistics

Since for jointly observed \((Y,Z)\) the two-way table counts can be viewed as a single multinomial distribution with \(IJ\) categories, we can apply the chi-square approximation in the same way we applied it for the goodness-of-fit tests; we just need to adapt to the double index and sum over all cells in both dimensions. That is, the quantity

\(\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}-n\pi_{ij})^2}{n\pi_{ij}}\)

is approximately chi-squared with degrees of freedom equal to \(\nu=IJ-1\). And if we have a null hypothesis in mind, say for independence, we can use the estimated probabilities under that hypothesis to construct both the Pearson and likelihood ratio test statistics. The degrees of freedom would be reduced by the number of estimated parameters as well. In what follows below, we assume the null hypothesis (reduced model) to be tested is that of independence, but other hypotheses could also be considered.

Pearson goodness-of-fit statistic
The Pearson goodness-of-fit statistic is

\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}-n\hat{\pi}_{ij})^2}{n\hat{\pi}_{ij}}\)

where \(\hat{\pi}_{ij}=(n_{i+}/n)(n_{+j}/n)\) under the independence model. Note that this expression still corresponds to

\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(O_{ij}-E_{ij})^2}{E_{ij}}\)

where \(O_{ij} = n_{ij}\) is the observed count and \(E_{ij} = E(n_{ij})\) is the expected count under the null hypothesis.

The deviance statistic can similarly be calculated for two-way tables:

\(G^2=2\sum\limits_{i=1}^I \sum\limits_{j=1}^J n_{ij}\log\left(\dfrac{n_{ij}}{n\hat{\pi}_{ij}}\right) = 2\sum\limits_{i=1}^I \sum\limits_{j=1}^J O_{ij}\log\left(\dfrac{O_{ij}}{E_{ij}}\right)\)

\(G^2\) is also called the likelihood-ratio test statistic or likelihood-ratio chi-squared test statistic. Recall from the discussion on one-way tables that we are comparing the likelihoods of the assumed model under \(H_0\) and some alternative model, \(H_A\), typically the saturated model (i.e., the observed data) by default. More generally, the likelihood-ratio test statistic can be described as follows:

Let max \(L(H_0)\) be the maximum of the likelihood when parameters satisfy \(H_0; H_0\) usually has more restrictions on the parameters.

Let max \(L(H_A)\) be the maximum of the likelihood when parameters satisfy \(H_A; H_A\) usually has no or fewer restrictions on the parameters.

Then the likelihood-ratio statistic would be:

\(\Lambda=\dfrac{\max L(H_0)}{\max L(H_A)}\)

and the deviance \(G^2 = −2\log(\Lambda)\).

The smaller the likelihood under \(H_0\) (less chance of the restricted model to hold given the data), the more evidence you would have against \(H_0\), that is, the smaller \(\Lambda\) and greater \(G^2\).

What are the degrees of freedom for this test?

The general rule for DF

DF are equal to the number of parameters specified (estimated) under the alternative model (hypothesis) minus the number of parameters estimated under the null model (hypothesis).

Computing Degrees of Freedom

Recall, under the saturated model, \(\pi\) contains \(IJ-1\) free (unique) parameters. And under the independence model, \(\pi\) is a function of \((I-1)+(J-1)\) parameters since each joint probability \(\pi_{ij}\) can be written as the product of the marginals \(\pi_{i+}\pi_{+j}\), each of which has the sum-to-one constraint. The degrees of freedom are therefore


For \(2\times2\) tables, this reduces to \((2 - 1)(2 - 1) = 1\).

A large value of \(X^2\) or \(G^2\) indicates that the independence model is not plausible and that \(Y\) and \(Z\) are related. Under the null hypothesis, \(X^2\) and \(G^2\) are approximately distributed as a chi-squared distribution with \(\nu= (I − 1)(J − 1)\) degrees of freedom, provided that (a) the \(n\) sample units are iid (i.e., there is no clustering), and (b) the expected counts \(E_{ij}\) are sufficiently large (recall the discussion in one-way tables) At a minimum we'd like to have all \(E_{ij}\ge1\) with at least 80% of them 5 or more.

For \(2\times2\) tables, the 95th percentile of \(\chi^2_1\) is 3.84, so an observed value of \(X^2\) or \(G^2\) greater than 3.84 means that we can reject the null hypothesis of independence at the .05 level.

Example: Vitamin C


How can we do the test of independence computationally? Let's illustrate this first using the Vitamin C example, which is the \(2\times2\) case. We're interested in whether the treatment type and contracting cold are associated.

Here is an example in SAS using the program code VitaminC.sas; see the R tab for the R code.

data ski;
input treatment $ response $ count;
placebo cold 31
placebo nocold 109
ascorbic cold 17
ascorbic nocold 122
proc freq;
weight count;
tables treatment*response/ chisq relrisk riskdiff expected;
exact or;

Let's look at different parts of the output. SAS output for this program.

Heading I:

Table of treatment by response section produces the table with observed, expected values, sample proportions, and conditional probabilities.

The FREQ Procedure

Row Pct
Col Pct
Table of treatment by response
treatment response
cold nocold Total
Heading II:

Statistics for Table of treatment by response section produces various test statistics, such as \(X^2\) and \(G^2\).

The FREQ Procedure

Statistics for Table of treatment by response

Statistic DF Value Prob
Chi-Square 1 4.8114 0.0283
Likelihood Ratio Chi-Square 1 4.8717 0.0273
Continuity Adj. Chi-Square 1 4.1407 0.0419
Mantel-Haenszel Chi-Square 1 4.7942 0.0286
Phi Coefficient   -0.1313  
Contingency Coefficient   0.1302  
Cramer's V   -0.1313  

\(X^2 = 4.8114\) and \(G^2 = 4.8717\), with df=1, indicate strong evidence for rejecting the independence model. Continuity Adj. Chi-Square = 4.1407 with \(p\)-value = 0.0419, is the Pearson's \(X^2\) adjusted slightly for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}-n_{ij}}-0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)-value than the usual Pearson \(X^2\) statistic without the correction. But since SAS can produce exact tests we won't need to consider this statistic.

Here is the test of independence for Vitamin C example, also found in the section with R files VitaminC.R and its output, VitaminC.out.

### Pearson's Chi-squared test  with Yates' continuity correction

###Let's look at the obseved, expected values and the residuals

###Pearson's Chi-squared test  withOUT Yates' continuity correction
result<-chisq.test(ski, correct=FALSE)

Notice that by default chisq.test() function in R will give us the \(X^2\) statistic with a slight adjustment for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}-n_{ij}}-0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)-value than the usual Pearson \(X^2\) statistic without the correction. To get the usual \(X^2\), we need to invoke an option correct = FALSE.

To compute the deviance statistic for two-way tables, we can use the function LRstats.R or one of the R packages, such as VCD.

Example: Coronary Heart Disease

Next, we return to the Coronary Heart Disease example from the introduction and ask "Is having coronary heart disease independent of the cholesterol level in ones body? Is there any evidence of a relationship/association between cholesterol and heart disease?"

Test of Independence in SAS - Coronary Heart Disease Example

Let's see the same calculation using the SAS code below: HeartDisease.sas

data chd;
input CHD $ serum $ count @@;
chd 0-199 12 chd 200-199 8 chd 220-259 31 chd 260+ 41
nochd 0-199 307 nochd 200-199 246 nochd 220-259 439 nochd 260+ 245
proc freq;  weight count;
   tables CHD*serum /chisq expected deviation cmh cellchi2 measures;
   /*exact fisher or / alpha=.05;*/

See the complete SAS output: HeartDisease SAS Output.

Here is a portion of the output from SAS with the Pearson chi-square statistic and Deviance (likelihood-ratio chi-square) statistic:

Statistics for Table of CHD by serum

Statistic DF Value Prob
Chi-Square 3 35.0285 <.0001
Likelihood Ratio Chi-Square 3 31.9212 <.0001
Mantel-Haenszel Chi-Square 1 26.1475 <.0001
Phi Coefficient   0.1623  
Contingency Coefficient   0.1603  
Cramer's V   0.1623  

Test of Independence in R - Coronary Heart Disease Example

Two different computations are done in HeartDisease.R file using the function chisq.test(). Here is the first:

heart <-c(12,8,31,41,307,246,439,245) 

## run the chi-squared test of independence & save it into a new object

## Let's look at the obseved, expected values and the residuals
### Example: Heart Disease Example Lesson 3 ##
### Simple line by line R code
### Nice R code that corresponds to SAS code and output

## enter data
heart <-c(12,8,31,41,307,246,439,245) 

## run the chi-squared test of independence & save it into a new object

## Let's look at the obseved, expected values and the residuals

### Likelihood Ratio Test

 ##make sure you have function LRstats()

 ## Let's calculate the conditional probabilities
 ## the following function gives the desired marginal, in this case, the counts for the serum groups

## let's look at the counts for the four groups with CHD

## then counts for the four groups with NOCHD, which is the second column of data in the dataframe we created above

### conditional probabilities are:

### Nice R code that corresponds to SAS code and output
heart=matrix(c(12,307,8,246,31,439,41,245), ncol=4, dimnames=list(CHD=c("chd", "nochd"), serum=c("0-199", "200-199","220-259","260+")))

### Chi-Square Independence Test

### Let us look at the Percentage, Row Percentage and Column Percentage 
### of the total observations contained in each cell.


###### Computing various measures of association

### For the Pearson correlation coefficent 
### and Mantel-Haenszel, 
### for IxJ tables, you can also use 
### pears.cor() function. 
### Mak sure you run this function first!
### c(1,2) and c(1,2,3,4), are the vectors of score values 
pears.cor(heart, c(1,2),c(1,2,3,4)) 
### and this should give you, r=-0.14, M2=26.1475



You will notice in the file that unlike for \(2\times 2\) tables where we had to worry about R  the continuity correction, there is no such thing for \( I \times J\) tables. It doesn't matter, in this example, if you call the function chisq.test(heart, correct=TRUE) or chisq.test(heart, correct=FALSE) because the results are the same.

\(X^2= 35.0285\), df = 3, \(p\)-value = 1.202e-07

and the likelihood-ratio test statistic is

\(G^2= 31.9212\), df = 3, \(p\)-value = 5.43736e-07

Notice that the chisq.test() function does not compute \(G^2\), but we included extra code to do that. For the complete output file, see the HeartDisease.out file. Here is also a more general function LRstats.R for computing the \(G^2\) for two-way tables. The results are discussed further below.


We reject the null hypothesis of independence because of the big values of the chi-square statistics. Notice the degrees of freedom are equal to \(3 = (4-1)(2-1)\), and thus the \(p\)-value is very low. Therefore, through the \(X^2\) test for independence, we have demonstrated beyond a reasonable doubt that a relationship exists between cholesterol and CHD.

A good statistical analysis, however, should not end with the rejection of a null hypothesis. Once we have demonstrated that a relationship exists, we may be interested in which cells were particularly revealing. To do this we consider computing and evaluating the residuals.

3.4 - Difference in Proportions

3.4 - Difference in Proportions

When the null hypothesis of independence is rejected, the nature of dependence---its direction and magnitude---needs to be measured. A statistic that measures the direction and magnitude of the relationship between two variables is called an effect size or a measure of association.

One of the most intuitive measures of association is the difference in proportions, which compares the relative frequency of important characteristics between two groups. For example in the Vitamin C study, we want to know if the probability of a member of the placebo group contracting cold is the same as a probability of a member of the ascorbic group contracting cold.

Regarding \(Z\) as a response and \(Y\) as explanatory variable, the difference in proportions for a \(2 \times 2\) table is

\(\delta =P(Z=1|Y=1)-P(Z=1|Y=2)= \dfrac{\pi_{11}}{\pi_{1+}}-\dfrac{\pi_{21}}{\pi_{2+}} = \pi_{1|1}-\pi_{1|2}\)

where \(\pi_{1|1}\) is the probability of "success" (e.g., "cold"), given row 1, and \(\pi_{1|2}\) is the probability of "success", given row 2. Recall that these are the conditional probabilities we already described for the Vitamin C example. Thus, the probability of "failure" (e.g., "no cold"), given row 1 is \(\pi_{2|1}\) and \(\pi_{1|1}+ \pi_{2|1}=1\). Similarly, we can find conditional probabilities, given row 2.

In social sciences and epidemiology, these are sometimes referred to as "risk" values. That is, we may refer to the probability that a person gets a cold, given that he/she took a placebo pill, as the risk of such an event. Furthermore, for diagnostic tests, the conditional probability that the diagnostic test is positive, given that the subject has a disease, is called sensitivity. The conditional probability that the diagnostic test is negative, given that the subject does NOT have a disease, is called specificity.

Finally, because \(\delta\) is a function only of the parameters of \(P(Z | Y )\), likelihood-based inferences will be the same, regardless if calculations assume Poisson sampling or multinomial sampling.

Point Estimation for \(\delta\)

The natural estimate of \(\delta\) is the difference in the conditional sample proportions:



  • It takes values between \(-1\) and \(+1\),
  • If variables are independent, the difference in the proportions equals 0.

This is the maximum-likelihood estimate (MLE) because under product-multinomial sampling, the numerators are independent binomials:

\(n_{11} \approx Bin( n_{1+}, \dfrac{\pi_{11}}{\pi_{1+}} )\)
\(n_{21} \approx Bin( n_{2+},\dfrac{\pi_{21}}{\pi_{2+}} )\)

For the Vitamin C example, the estimated or sample difference of proportions of getting a cold is \(d=17/139 - 31/140 = 0.12 - 0.22 = -0.10\).

  Cold No Cold Totals
Placebo 0.22 0.78 1
Ascorbic Acid 0.12 0.88 1

 Question: Is this difference "big" or "small"?

Confidence interval for \(\delta\)

If \(n_{1+}\) and \(n_{2+}\) are large, the estimate \(d\) is approximately normal with variance

\(\displaystyle V(d)=\left[\frac{ \frac{\pi_{11}}{\pi_{1+}} (1-\frac{\pi_{11}}{\pi_{1+}})} {n_{1+}} + \frac{\frac{\pi_{21}}{\pi_{2+}} (1-\frac{\pi_{21}}{\pi_{2+}})} {n_{2+}} \right] \)

This expression follows from the fact that if \(X_1\) and \(X_2\) are independent random variables, then \(Var(X_1 − X_2) = Var(X_1) + Var(X_2)\). Both \(X_1\) and \(X_2\) here are Bernoulli random variables.

Plugging in the estimates, we get the estimate of the variance


which is used for computing the standard errors and confidence intervals.

A large sample, \((1- \alpha) 100\%\) CI for the Vitamin C example is

\(SE(d)=\sqrt{\hat{V}(d)}=\sqrt{\dfrac{0.12\times 0.88}{139}+\dfrac{0.22\times 0.78}{140}}=0.045\)
\(\Rightarrow -0.10\pm 1.96\times 0.045=(-0.19,-0.01)\)


We are 95% confident that the true difference in proportions of people getting cold given the placebo or a vitamin C is somewhere between 1% and 19%. Note that the value of 0 (corresponding to no difference) does not fall within that boundary.

Hypothesis testing for \(\delta\)

Under the null hypothesis of no difference, \(H_0: \delta = 0\), the rows of the table can be pooled to get an estimate of the common proportion, \( P(Z = 1 | Y = 1) = P(Z = 1 | Y = 2)\).

The pooled estimate is


Under \(H_0: \delta = 0\), a more efficient estimate of variance \(V (d)\) is


and the test statistic


is approximately distributed as \(N(0,1)\). Many elementary textbooks in statistics use this test to compare two proportions. For the Vitamin C example,

\(z=\dfrac{31/140-17/139}{\sqrt{\frac{48}{279}\times \frac{231}{279}\times (\frac{1}{140}+\frac{1}{139})}}=2.19\)

This value is significant at the 0.05 level, so if data are taken at face value, we could conclude that the proportion of colds in vitamin C group is smaller than in the placebo group. The result is consistent with confidence interval inference.

For computation in SAS, see VitaminC.sas, Vitamin C SAS Output. The analysis can be done with PROC FREQ, using options MEASURES or RISKDIFF. Compare the values of the above calculations to relevant SAS output under heading "Statistics for Table of Treatment and Response". Notice that \(d=-0.099\); we just rounded the value to \(-0.01\) in our calculations.

For computation in R, see VitaminC.R. We can use, among others, either prop.test() or binom.test() to get these probabilities.

Equivalence to statistical independence test

It is useful to note that the null hypothesis \(H_0:\delta = 0\) is equivalent to independence. Using conditional probabilities, this corresponds to

\(\pi_{j|1} = \pi_{j|2}\)

Thus we can test \(\delta = 0\) by the usual \(X^2\) or \(G^2\) test for independence in a \(2\times 2\) table already discussed in the previous sections. In fact, we can show that \((\text{z-statistic})^2\) above is algebraically equal to \(X^2\). So a two-sided test based on comparing the z-statistic to a \(N(0, 1)\) distribution is identical to comparing \(X^2\) from the test of independence to a chi-squared distribution with \(\chi^2_1\). In the Vitamin C example, note that \(X^2 = 4.81 \approx (2.19)^2\).

Even though the difference of two proportions is very easy to interpret, one problem with using \(\delta\) is that when \(Z = 1\) is a rare event, the individual probabilities \(P(Z = 1 | Y = 1)\) and \(P(Z = 1 | Y = 2)\) are both small, i.e., close to zero. The absolute value of \(\delta\) will be close to zero even when the effect is strong. In the following sections, we study two other common measures of association which compare the relative value of the proportions, rather than the absolute values.

3.5 - Relative Risk

3.5 - Relative Risk

Consider for example the yearly risk of cancer death to be .005 in one population and .01 in another population; most health officials would consider this difference to be important, even though \(\delta = .005\) is very small. This suggests the use of the relative risk of a "success" which is the ratio of the relevant conditional probabilities,

Relative Risk

Like \(\delta\), this is a function only of the parameters of \(P(Z | Y )\), so likelihood-based inferences about \(\rho\) will be the same under Poisson, multinomial, or product-multinomial (\(n_{i+}\)s fixed) sampling.

Estimation and hypothesis testing

The natural (maximum-likelihood) estimate of \(\rho\) uses observed data:


Because \(\rho\), and thus \(r\), are non-negative, a normal approximation for \(\log r\) tends to work better than for \(r\) itself. So in this case, it makes sense to transform the data using the natural log (ln) which will allow us to use a less skewed distribution. The approximate variance of \(\log r\) is

\(V(\log r)\approx \dfrac{1-\pi_{11}/\pi_{1+}}{n_{1+}\pi_{11}/\pi_{1+}}+\dfrac{1-\pi_{21}/\pi_{2+}}{n_{2+}\pi_{21}/\pi_{2+}}\)

which is estimated by

\hat{V}(\log r)&= \dfrac{1-n_{11}/n_{1+}}{n_{1+}n_{11}/n_{1+}}+\dfrac{1-n_{21}/n_{2+}}{n_{2+}n_{21}/n_{2+}}\\
&= \dfrac{1}{n_{11}}-\dfrac{1}{n_{1+}}+\dfrac{1}{n_{21}}-\dfrac{1}{n_{2+}}\\

An approximate 95% confidence interval for \(\log \rho\) is

\(\log r\pm 1.96\sqrt{\hat{V}(\log r)}\)

and the corresponding interval for \(\rho\) is found by exponentiating the endpoints.

Equivalence to statistical independence testing

Testing \(H_0 \colon \rho = 1\) is equivalent to testing for independence in a \(2 \times 2\) table. For our current example, \(r = \dfrac{0.12}{0.22}= 0.5523\) and has corresponding 95% CI \((0.3209, 0.9506)\). Thus, at 0.05 level, we can reject the independence model and furthermore can specifically say that the skiers who took vitamin C are less likely (nearly two times less likely) to get a cold in comparison with the skiers who did not take vitamin C. Notice that we could have computed \(\rho = \dfrac{0.22}{0.12} = 1.8\), which would lead to the same conclusion.

Keep in mind that relative risk is a better measure of association than the difference in proportions when cell probabilities are close to 0 and 1, i.e., when they are in the tails of the probability distribution. Relative risk has far more application in medical literature and epidemiology than difference of proportions.

For the Vitamin C example, compare the above calculations to relevant SAS output under the heading "Statistics for Table of Treatment and Response: Estimates of Relative Risk". In the output, the row labeled "Relative Risk (Column 1)" gives the above estimate of the relative risk and its 95% CI.
Odds Ratio and Relative Risks
Statistic Value 95% Confidence Limits
Odds Ratio 0.4900 0.2569 0.9343
Relative Risk (Column 1) 0.5523 0.3209 0.9506
Relative Risk (Column 2) 1.1273 1.0120 1.2558
The corresponding R code is in the VitaminC.R file. Please also note that there are many other ways and readily available packages in R that can do the same thing.

3.6 - Odds Ratio

3.6 - Odds Ratio

This is perhaps the most commonly used measure of association. Later on, we will see this is a natural parameter for many of the log-linear and logistic models.

The odds are ratios of probabilities of "success" and "failure" for a given row, or a ratio of conditional probabilities of the same conditional distribution.

Odds of getting a cold versus not getting a cold given that a person took a placebo:


The second odds (given that ascorbic acid was taken),


Properties of odds

  • If odds equal to 1, "success" and "failure" are equally likely.
  • If odds > 1, then "success" is more likely than "failure".
  • If odds < 1, then "success" is less likely than "failure".
Odds Ratio

The odds ratio, is the ratio of odds1 and odds2 (or vice versa):

\theta &= \dfrac{P(Z=1|Y=1)/P(Z=2|Y=1)}{P(Z=1|Y=2)/P(Z=2|Y=2)}\\
&= \dfrac{\left(\dfrac{\pi_{11}}{\pi_{1+}}\right)/\left(\dfrac{\pi_{12}}{\pi_{1+}}\right)}{\left(\dfrac{\pi_{21}}{\pi_{2+}}\right)/\left(\dfrac{\pi_{22}}{\pi_{2+}}\right)}\\
&= \dfrac{\pi_{11}\pi_{22}}{\pi_{12}\pi_{21}}\\

Clearly, \(\theta\) is a function of the parameters of \(P(Z | Y )\), so inferences about it should be the same under Poisson, multinomial, or product-multinomial (\(n_{i+}\)s fixed) sampling. But if we interchange the roles of \(Y\) and \(Z\), we still get


so \(\theta\) can also be regarded as a function of the parameters of \(P(Y| Z )\). Therefore, the likelihood inferences will be the same if we regard the \(n_{+j}\)s as fixed.

Point estimate, CI and hypothesis test

The natural estimate of \(\theta\) is the sample cross-product ratio,


The properties of \(\hat{\theta}\) are easily established under multinomial sampling, but the same properties will hold under Poisson or product-multinomial sampling with either the row totals or column totals (but not both) regarded as fixed.

As with the relative risk, the log-odds ratio \(\log\hat{\theta}\) has a better normal approximation than \(\hat{\theta}\) does. Therefore, we usually obtain a confidence interval on the log scale; please note again that log throughout this course is a natural log, that is log base \(e\). The estimated variance of \(\log\hat{\theta}\) is easy to remember,


and we get a 95% confidence interval for \(\theta\) by exponentiating the endpoints of

\(\log\hat{\theta} \pm 1.96\sqrt{\dfrac{1}{n_{11}}+\dfrac{1}{n_{12}}+\dfrac{1}{n_{21}}+\dfrac{1}{n_{22}}}\)

Note! Testing \(H_0 \colon \theta= 1\) is equivalent to testing independence in \(2 \times 2\) tables.

For the Vitamin C example, the odds of "success" (i.e., getting a cold), given that a skier took vitamin C, are \(0.12/0.88 = 0.14\). The odds of "success" (i.e., getting a cold), given that a skier took a placebo pill, are \(0.22/0.78 = 0.28\).

The odds ratio is \(0.14/0.28 = 0.49\), and the 95% CI for \(\log\theta\) would be

\(\log(0.490)\pm 1.96 \sqrt{1/17+1/109+1/122+1/31}=\)

Finally, exponentiating limits gives us the 95% CI for \(\theta\): (0.256, 0.934). Notice, that we could have also computed \(0.28/0.14=2.04=31(122)/(109(17))\), which is the inverse of the above value we computed: \(1/0.49=2.04\). For our example, \(\hat{\theta}=0.49\) means that

  • the odds of getting a cold given vitamin C are .49 times the odds of getting cold given a placebo
  • the odds of getting a cold given a placebo are \(1/.49 = 2.04\) times greater than the odds of given vitamin C
  • getting cold is less likely given vitamin C than given a placebo.

For computation in SAS, for the Vitamin C example compare the above calculations to relevant SAS output under heading "Statistics for Table of treatment and response: Odds Ratio";

Odds Ratio
Odds Ratio 0.4900
Asymptotic Conf Limits  
95% Lower Conf Limit 0.2569
95% Upper Conf Limit 0.9343
Exact Conf Limits  
95% Lower Conf Limit 0.2407
95% Upper Conf Limit 0.9740
The best way to get all association measures is to use option MEASURES in PROC FREQ. Working with the VitaminC.sas file, we can replace the following line


tables treatment*response/ chisq relrisk riskdiff expected;


tables treatment*response/ chisq measures expected;

The computation in R is available with the VitaminC.R file.

For more on the interpretation of odds and odds-ratios and their properties see below.  


Properties of Odds Ratios

If \(\theta = 3\), the odds of "success" in row 1 are 3 times greater than the odds of success in row 2; individuals in row 1 are more likely to have a "success" than those in row 2. If \(\theta = 0.3\), the odds of "success" in row 1 are 0.3 times the odds of the row 2; the odds of "success" in row 2 are \((1/0.3) = 3.33\) times the odds in row 1.

The relationship between odds and probabilities can be expressed as


If the variables are independent, then \(\pi_{1|1} = \pi_{1|2}\), \(odds_1 = odds_2\), and


If the variables are not independent such that  \(\pi_{1|1} > \pi_{1|2}\), then \(odds_1 > odds_2\), and

\( 1<\theta\)

If the variables are not independent such that \(\pi_{1|1} < \pi_{1|2}\), then \(odds_1< odds_2\), and

\( 0 < \theta < 1\)

If both \(\pi_{1|1}\) and \(\pi_{1|2}\) are small in the population, then the odds ratio and relative risk will be close since \(\frac{1-\pi_{1|1}}{1-\pi_{1|2}}\) will be close to 1. The odds ratio \(\theta\) does NOT depend on the marginal distribution of either variable. If the categories of both variables are interchanged, the value of \(\theta\) does not change. If the categories of one variable are switched, the odds ratio in the new re-arranged table will equal \(1/\theta\).

Finally, note that the sample odds ratio will equal zero or \(\infty\) if any \(n_{ij}=0\). Some authors suggest adding \(1/2\) to each cell count and then recalculating the sample odds ratio and its standard error to avoid this issue.

3.7 - Prospective and Retrospective Studies

3.7 - Prospective and Retrospective Studies

In epidemiology, three different types of studies are commonly done depending on whether the disease condition is first fixed and then the possible causes (exposure to a risk factor) are assessed or whether exposed and unexposed individuals are followed until the disease is developed. We will introduce two of those here.

Suppose variable \(Z\) represents a condition (disease) that is relatively rare in a population (e.g. lymphoma), and we want to assess whether another characteristic or behavior \(Y\) (e.g. smoking) could be a risk factor for \(Z\).

The obvious way to study this is to follow a group of smokers \((Y = 1)\) and a group of nonsmokers \((Y = 2)\) over time, and see which ones eventually develop lymphoma \((Z = 2)\) and which do not \((Z = 1)\). This is called a prospective study. The exposed and unexposed groups are determined at the start of the study and both groups are disease-free. While it makes logical sense in determining a significant relationship, it can be very

  • time-consuming (we have to wait for a long time for the problem condition to develop)
  • inefficient (we may need very large samples to obtain enough subjects with \(Z = 1)\)

An alternative is the retrospective study, in which we first locate a group of subjects with lymphoma \((Z = 1)\) and identify which are smokers and which are not. Here a diseased group is determined first and is retrospectively assessed for exposure status. Then we locate another group of subjects who are in some sense "comparable" but who do not have lymphoma \((Z = 2)\) and identify which are smokers and which are not. In the retrospective study, we have "sampled on the outcome," choosing individuals on the basis of \(Z\) and then observing \(Y\).

The interchangeability of \(Y\) and \(Z\) means that the usual roles of "response" and "explanatory" variables can be reversed, which could be extremely useful for research. Because the odds ratio is invariant to exchanging \(Y\) and \(Z\), the odds ratio from a retrospective study should be about the same as the odds ratio from a prospective study in which we sampled individuals according to their \(Y\) values and collected information on \(Z\). A retrospective study provides no information about the overall incidence of \(Z\) in the population because the proportions of cases with \(Z = 1\) and \(Z = 2\) were decided by the investigator. However, it does provide consistent estimates of the odds ratio indicating the effect of \(Y\) on \(Z\).

Example: Lung Cancer

The table below is adapted from Doll and Hill (1950), where 709 lung cancer sufferers were matched with 709 individuals without lung cancer to serve as a control. This is an example of a retrospective, case-control study. A study is called case-control when "cases" or diseased subjects and "controls" or comparable non-diseased subjects are sampled from respective populations and then assessed on their risk-factor exposure status.

  Cancer Yes Cancer No Totals
Smoking Yes  688 650 1338
Smoking No 21 59 80
Totals 709 709 1418

Lung cancer is the natural response variable of interest, and we would like to condition on smoking to estimate conditional probabilities of cancer, given smoking status. But since the column (lung cancer frequencies) are fixed by design, each column is a separate binomial distribution---not each row---and the sample conditional probabilities based on row totals do not reflect the corresponding population proportions.

In other words, with retrospective studies, the sample sizes are fixed in a way that's counter-intuitive to how we'd like to view the variables as explanatory and response. This also affects the interpretation of the relative risk because it's based on the same conditional probabilities. Fortunately, the odds ratio is numerically invariant, regardless of which totals (row or column) are fixed, which makes it an appropriate measure of association for both retrospective studies as well as prospective studies.

To see this invariance for the data above, we can calculate the sample odds ratio as

\(\displaystyle \hat{\theta}=\dfrac{688/650}{21/59}= \dfrac{688/21}{650/59}=2.97 \)

Thus, for this sample, the odds of lung cancer among smokers is 2.97 times the odds of lung cancer among non-smokers. Equivalently, the odds of smoking among those with lung cancer is 2.97 times the odds of smoking among those without lung cancer.

Source: R. Doll and A. B. Hill, Br. Med. J., 739--748, Sept. 30, 1950.

3.8 - Measures of Associations in \(I \times J\) tables

3.8 - Measures of Associations in \(I \times J\) tables

In the Coronary Heart example, it is sensible to think of serum cholesterol as an explanatory variable and CHD as a response. Therefore, it would make sense to estimate the conditional probabilities of CHD within the four cholesterol groups. To do this, we simply divide each cell count \(n_{ij}\) by its column total \(n_{+j}\); the resulting proportion \(n_{ij}/n_{+j}\) is an estimate of \(P(Y = i |Z = j)\). To see this, note that


and is intuitively estimated by


These values correspond to "Col Pct" in the SAS output. In R, we need to calculate them based on the above formula, e.g., see HeartDisease.R. The result is shown below.

  0-199 200-219 220-259 260+
CHD 12/319
= .038
= .031
= .066
= .143
no CHD 307/319
= .962
= .969
= .934
= .857

The risk of CHD appears to be essentially constant for the groups with cholesterol levels between 0–199 and 200–219. Although the estimated probability drops from .038 to .031, this drop is not statistically significant. We can test this by doing a test for the difference in proportions or by doing a chi-square test of independence for the relevant \(2 \times 2\) sub-table:

  0-199 200-219
CHD 12/319
= .038
= .031
no CHD 307/319
= .962
= .969

The test yields a \(X^2 = 0.157\) with df=1, \(p\)-value = .69. For the other two groups, however, the risk of CHD is substantially higher. We can do similar tests for other sets of cells. In fact, any two levels of cholesterol may be compared and tested for association between CHD and cholesterol level.

Describing associations in \(I \times J\) tables

In a \(2 \times 2\) table, the relationship between the two binary variables could be summarized by a single number (e.g., odds ratio). For an \(I \times J\) table, the usual \(X^2\) or \(G^2\) test for independence has \((IJ − 1) − (I − 1) − (J − 1) = (I − 1)(J − 1)\) degrees of freedom. This means that, with \(I > 2\) or \(J > 2\), there are multiple dimensions to the manner in which the data can depart from independence. The direction and magnitude of the departure from the null hypothesis can no longer be summarized by a single number, but must be summarized by \((I −1)(J −1)\) numbers of (i) difference in proportions, and/or (ii) relative risk, and/or (iii) odds ratios.

In the Coronary Heart Disease study, for example, we could summarize the relationship between CHD and cholesterol level by a set of three relative risks:

  • 200–219 versus 0–199,
  • 220–259 versus 0–199, and
  • 260+ versus 0–199.

That is, we could estimate the risk of CHD at each cholesterol level relative to a common baseline. Or, we could use

  • 200–219 versus 0–199,
  • 220–259 versus 200–219, and
  • 260+ versus 220–259,

This estimates the risk of each category relative to the category immediately below. Other comparisons are also possible, but they may not make sense in interpreting the data).

Note! You can do this as an exercise by modifying the code in R or SAS.

Example: Smoking Behaviors

The table below classifies 5375 high school students according to the smoking behavior of the student \(Z\) and the smoking behavior of the student’s parents \(Y\). We are interested in analyzing if there is a relationship of smoking behavior between the students and their parents?

How many parents smoke? Student smokes?
  Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168

The test for independence yields \(X^2 = 37.6\), and \(G^2 = 38.4\) with df = 2 (\(p\)-values are essentially zero), so we have decided that \(Y\) and \(Z\) are related. It is natural to think of \(Z\) in this example as a response and \(Y\) as a predictor, so we will discuss the conditional distribution of \(Z\) given \(Y\). Let \(\pi_i = P(Z = 1|Y = i)\), for \(i=1,2,3\). The estimates of these probabilities are




We can then compare these as risks associated with the parameters. The effect of \(Y\) on \(Z\) can be summarized with two differences. For example, we can calculate the increase in the probability of \(Z = 1\) as \(Y\) goes from 3 to 2, and as \(Y\) goes from 2 to 1:



Alternatively, we may treat \(Y = 3\) as a baseline and calculate the increase in probability as we go from \(Y = 3\) to \(Y = 2\) and from \(Y = 3\) to \(Y = 1\):



We may also express the effects as the sample odds ratios (e.g., look at any \(2\times 2\) table within this larger \(3 \times 2\) table):

\(\hat{\theta}_{23}=\dfrac{416\times 1168}{188\times 1823}=1.42\)

\(\hat{\theta}_{13}=\dfrac{400\times 1168}{188\times 1380}=1.80\)

The estimated value of 1.42 means that students with one smoking parent are estimated to be 42% more likely (on the odds scale) to smoke than students whose parents do not smoke (the last two rows of the table). The value of 1.80 means that students with two smoking parents are 80% more likely to smoke than students whose parents do not smoke (the first and the last rows of the table).

In a \(3 \times 2\) table, the relationship between the two variables must be summarized with two differences in proportions or two relative risks or two odds ratios. More generally, to describe the relationship between the two variables in an \(I × J\) table will require \((I − 1)(J − 1)\) numbers. You can specify a large number of different odds ratios depending on the size of the table, yet the minimum number of these ratios that efficiently describes the data is described as \((I - 1)(J - 1)\) number of ratios. There is a relationship between the minimum number of odds ratios and degrees of freedom for testing independence. Which odds ratios are most meaningful to the researcher depends on the research question at hand.

Besides the point estimates, we can also test hypotheses about the odds ratios or compute confidence intervals. You could do the same for the relative risks or difference in proportions as we discussed in previous sections. To do this computationally in SAS and/or R, we need to analyze each \(2\times2\) sub-table separately. Basically, we treat each \(2\times2\) table as a "new" data set.

In SAS the OPTION ALL should give all possible measures; see smokeindep.sas (output,  smokeindep SAS output). Depending which SAS version you are using the OPTIONS may be different, e.g., RELRISK, RRC1, RRC2, etc., and some of them work only for \(2\times2\) tables. For the current list, see the current SAS Support Documentation

In R, see smokeindep.R (output, smokeindep.out). The {vcd} package has a number of useful functions, e.g., oddsratio(), assocstats(); the latter will give  \(X^2\), \(G^2\), and some other measures of associations, such as Cramer's V.

Statistical versus Practical Significance

In proposing measures of effect size, we need to realize that there is a difference between saying that an effect is statistically significant and saying that it is large.

A test statistic or p-value is a measure of the evidence against a null hypothesis, and this evidence depends on the sample size. An effect size, however, should not change if n is arbitrarily increased.

In some situations, there may be an artificial dependency of statistical significance on sample size. If the sample size is small, and large-sample goodness-of-fit statistic is computed, the \(p\)-value may not be the best statistic to depend upon because the large-sample theory will not hold. Alternatively, if the sample size is very large you may obtain significant results where there really should not be one. Also, recall Type I and Type II errors of hypothesis testing.

The \(X^2\) and \(G^2\) test statistics are not appropriate measures of association between two variables. They are sufficient to test the null hypothesis, but not to describe the direction and magnitude of the association.

Here is a hypothetical example that will help to illustrate this point. First, consider the Vitamin C example again. The following table classifies a sample of French skiers by whether they got a cold, or not given a placebo or Vitamin C (ascorbic acid).

  Cold No Cold Totals
Placebo 31 109 140
Ascorbic Acid 17 122 139
Totals 48 231 279

The test for independence yields \(X^2 = 4.814\) and the \(p\)-value=0.0283. The conclusion here would be that there is strong evidence against the null hypothesis of independence. Now, let's suppose that we artificially inflate the sample size by multiplying each cell count by ten.

  Cold No Cold Totals
Placebo 310 1090 1400
Ascorbic Acid 170 1220 1390
Totals 480 2310 2790

The cell proportions for this table remain identical to those of the previous table; the relationship between the two binary variables appears to be exactly the same. Yet, now the \(X^2\) statistic is \(10(4.811) = 48.11\). The new \(p\)-value is close to 0, so the evidence against independence is now VERY strong to reject the null---not because the relationship between the two variables is any stronger, but merely because the sample size has gone up.

Warning: A large \(p\)-value is NOT strong evidence in favor of \(H_0\). A large \(p\)-value can occur if (1) \(H_0\) is indeed true, or (2) \(H_0\) is false, but the test has low power.

Now, let's suppose that we artificially deflate the sample size by dividing each cell count by ten, and account for rounding.

  Cold No Cold Totals
Placebo 3 11 14
Ascorbic Acid 2 12 14
Totals 5 23 28

The cell proportions for this table remain nearly identical to those of the previous tables; the relationship between the two binary variables appears to be the same. Yet, now the \(X^2\) statistic is 0.2435. The new \(p\)-value is 0.6217 which gives little or no evidence against independence, but it does not tell us whether the weakness of association is due to (a) weak correlation between the two variables, or (b) the sample size being too small. Moreover, \(X^2\) says nothing about the direction of the possible effect, e.g. whether vitamin C takers are more likely or less likely to be sick than non-takers of vitamin C.

  and   Notes

The above analysis was implemented in VitaminC.sas (output file VitaminC SAS Output). The corresponding R code file is VitaminC.R and is commented so that results similar to those in SAS can be obtained.

3.9 - Diagnostic Measures

3.9 - Diagnostic Measures


Recall that residuals tell how far off are the expected and observed values for each cell, under the assumed model. They tell us which cells drive the lack of fit. We can check for Pearson and standardized residuals calculated under the null model, just as we did for one-way tables.

Pearson Residual

The Pearson residual for a cell in a two-way table is


where the chi-squared statistic then is: \(X^2=\sum_j\sum_i r^2_{ij}\)

\(r_{ij}\)’s have an approximate Normal distribution with mean 0, but their variances are not all equal! Typically their asymptotic variances are less than 1 and average variance equals \([(I − 1)(J − 1) / (\mbox{number of cells})]\).

Standardized (adjusted) Pearson Residual

The standardized (adjusted) Pearson residual for a cell in a two-way table is


A standardized Pearson residual has an approximate \(N(0,1)\) distribution. A value that exceeds 2 or 3 in absolute value, therefore, suggests a lack of fit. For the heart disease example data, the residual in the \((2,1)\) cell is


and would suggest some lack of fit of the independence model. It's also important to keep in mind, however, that the more cells involved, the more likely we are to observe an extreme residual by chance, even if the independence model holds.

In , SAS PROC FREQ the DEVIATION option gives the raw residuals (i.e., just the difference between the expected and observed values) and the CELLCHI2 option gives the squared Pearson residuals. Keep in mind that Pearson residuals are less variable than the standard normal variate; although notice that if the product of the marginal sample probabilities in the denominator is approximately equal to 1, that the adjusted Pearson residuals and the regular Pearson residuals are approximately equal.

The squared standardized Pearson residual values will have approximately chi-squared distribution with df = 1; thus at a critical alpha value 0.05, a value of the squared standardized Pearson residuals greater than 4 (i.e., \(\chi^2(1, 0.05) = 3.84)\) will be considered significant (this can be used as a very crude cut-off for the squared Pearson residuals too). For other options in SAS explore the SAS documentation on PROC FREQ. For our example, see HeartDisease SAS Output(part of the output is below).

Here are the results from the Coronary Heart Disease example:

The SAS System

The FREQ Procedure

Cell Chi-Square
Row Pct
Col Pct
Table of CHD by serum
CHD serum
0-199 200-199 220-259 260+ Total

Notice the values in the third row of the first, second and the fourth cell, e.g., 4.60, 5.22, 22.70. These are squared Pearson residuals and much larger than 3.84, and they seem to be driving the lack of independence. [As an exercise, compute the standardized Pearson residuals and see if your inference would change.]

Let's further investigate the dependence structures in this table.

In R, chisq.test(your data)\$residuals gives the Pearson residuals. In our Heart Disease example, see result\$residuals and the corresponding output in HeartDisease.out.

Notice that if the product of the marginal sample probabilities in the denominator is approximately equal to 1, the adjusted Pearson residuals and the regular Pearson residuals are approximately equal. The squared standardized Pearson residual values will have approximately chi-squared distribution with df = 1; thus at a critical alpha value 0.05, a value of the squared standardized Pearson residuals greater than 4 (i.e., \(\chi^2(1, 0.05) = 3.84\)) will be considered significant (this can be used as a very crude cut-off for the squared Pearson residuals too). A very crude cut-off for evaluating Pearson residuals, we can use the absolute value that exceeds 2 or 3. However, do keep in mind that Pearson residuals are less variable than the standard normal variate.

Partitioned Tests

Besides looking at the residuals or the measures of association, another way to describe the effects is to form a sequence of smaller tables by combining or collapsing rows and/or columns in a meaningful way---in other words, by looking into specific smaller tables within the larger table.

Partitioning chi-squared uses the fact that the sum of independent chi-squared statistics are themselves chi-squared statistics with degrees of freedom equal to the sum of the degrees of freedom for the individual statistics. The reason why this works is that the multinomial distribution may be collapsed into multinomials and partitioned into product-multinomials. We start with a chi-squared statistics with df > 1 and break it down into parts, such that each new statistic has df = 1. This partitioning helps to show that significant associations for the whole table are driven by differences between some subset of categories.

Typically (just as in odds ratios), for an \(I \times J\) table, there will be \((I − 1) \times (J − 1)\) partitions. In our example, we have \((3-1)(2-1) = 2\) parts. Let’s combine the first two rows:

  Student smokes Student doesn’t
1–2 parents smoke 816 3203
Neither parent smokes 188 1168

This table has \(X^2 = 27.7\), \(G^2 = 29.1\), \(p\)-value \(\approx 0\), and \(\hat{\theta}=1.58\). We estimate that a student is 58% more likely, on the odds scale, to smoke if he or she has at least one smoking parent.

We may now ask, "Among those students with at least one smoking parent, is there any difference between those with one smoking parent and those with two smoking parents?" Given that at least one parent smokes, is there any evidence that the other parent’s smoking affects the chances that the student will smoke?

To answer this, we discard the last row of the original table and look at the upper \(2 \times 2\) sub-table.

  Student smokes Student doesn’t
Both parents smoke 400 1380
One parent smokes 416 1823

This table has \(X^2 = 9.3\), \(G^2 = 9.2\), \(p\)-value \(\approx .002\), and \(\hat{\theta}=1.27\). Given that at least one parent smokes, the fact that the other parent smokes does indeed raise the student’s probability of smoking; the effect, however, is not as large (\(\hat{\theta}=1.27\)) as it was in going from neither parent smoking to at least one parent smoking (\(\hat{\theta}=1.58\)).

Notice what happens if we add up the G2 values from these two \(2 \times 2\) tables:

\(29.1 + 9.2 = 38.3\)

The result is very close to 38.4, the value of \(G^2\) that we got for the full \(3 \times 2\) table. In fact, these two numbers should have come out exactly the same; the difference in the last decimal place was merely due to a rounding error. It is possible to show by theoretical results that partitions of \(G^2 \) do add up to the total \(G^2\). However, that is not true for the Pearson \(X^2\). The individual \(X^2\) values do not add up exactly to the overall \(X^2\), but they are pretty close:

\(27.7 + 9.3 = 37.0 \approx 37.6\).

When we analyze a \(3 \times 2\) table in this manner---by combining two rows into a single row, and then un-combining them again---we have partitioned the 2 degree-of-freedom test for independence into two single degree-of-freedom tests. By breaking up the test for independence into a sequence of tests for smaller tables, we can often identify precisely how the categorical variables may or may not be related. We compare these deviation statistics to investigate where the differences are coming from which could be very helpful when, for instance, you might be designing a future study and are exploring which parameters are important to include.

In practice, there are often many different ways to break up an \( I \times J\) table into a sequence of smaller tables. It is a good idea to do it in such a way that independence tests for each of the smaller tables pertain to a question that makes sense in the context of the individual problem, keeping the above rules in mind.

3.10 - Lesson 3 Summary

3.10 - Lesson 3 Summary

In this lesson, we focused on the analysis of two-way tables. Beginning with the \(2 \times 2\) case we described the concept of independence for two discrete random variables and showed how to do the Chi-Square test of independence. Then, we discussed three different measures of associations, e.g., the difference in conditional proportions, relative risk and odds-ratios, and their relations to the test of independence. We also saw how the residuals can be used to assess which cells, in particular, may have led to a significant result and how to apply an "exact" version of the independence test to tables of very small cell counts.

The concepts of independence, associations and marginal and conditional probabilities are very important for the analysis of categorical data. We will find the same concepts in more complex contingency tables, and throughout the course and methodology for analysis of categorical data. In the next lesson, we consider data that has a natural ordering and structure measures of association that take advantage of this.

Has Tooltip/Popover
 Toggleable Visibility