3.1 - Notation & Structure

When collecting data on two categorical variables, we can easily summarize the responses in the form of a table with the levels of one variable corresponding to the rows, the levels of the other variable corresponding to the columns, and the count of individuals answering accordingly in each cell. Specifically, we'll frequently use the following terms:

Two-way contingency table: A two-way contingency table is a cross-classification of observations by the levels of two discrete variables.

Cell: The cells of the table contain frequency count.

Dimension: The dimension of the table is determined by the number of variables.

Size: The size of the table refers to the number of cells. For example, a 2-dimensional (2-way) table of size \(2\times2\), is a cross-classification table of two discrete variables, each with two levels, having a total of 4 cells.

Example: Therapeutic Value of Vitamin C

This is an example of a double-blind study investigating the therapeutic value of vitamin C (ascorbic acid) for treating common colds. The study was conducted during a 2 week period on a sample of 280 French skiers, but one observation had to be dropped. There are two discrete variables each having two levels - hence the two-way table.

Table 1: Incidence of Common Colds involving French Skiers (Pauling, 1971) as reported in Fienberg (1980).

	Cold	No Cold	Totals
Placebo	31	109	140
Ascorbic Acid	17	122	139
Totals	48	231	279

Each cell indicates levels of both traits. For example, 31 skiers were given a placebo and contracted cold while 109 did not. Here is the same data, presented in a \(2\times2\) table using sample proportions instead.

Table 2: Incidence of Common Colds involving French Skiers (Pauling, 1971) as reported in Fienberg (1980).

	Cold	No Cold	Totals
Placebo	0.111	0.391	0.502
Ascorbic Acid	0.061	0.437	0.498
Totals	0.172	0.828	1

Here are some questions that we may ask regarding this data.

Is the probability that a member of the placebo group contracts a cold the same as the probability that a member of the ascorbic group contracts a cold?

Are the type of treatment and cold status associated or independent? Here, independence means that having a placebo or ascorbic acid has no relationship with having a cold or otherwise.

What are the odds of getting a cold for those taking ascorbic acid (vitamin C)?

Example: Coronary Heart Disease

This is an example of a \(2\times4\) table. The data below are taken from the Framingham longitudinal study of coronary heart disease (Cornfield, 1962). In this study, \(n=1329\) patients were classified by serum cholesterol level (mg/100 cc) and whether they had been diagnosed with coronary heart disease (CHD).

	0–199	200–219	220–259	260+	total
CHD	12	8	31	41	92
no CHD	307	246	439	245	1237
total	319	254	470	286	1329

One variable is binary and has two outcomes. The other variable has four levels - this is most likely a continuous variable, but it has been grouped into four intervals.

Stop and Think!

What would you call the variable Cholesterol: nominal, ordinal, or interval?

For the Coronary Heart Disease example, the variable along the top row of the table is the amount of Total Cholesterol. This was originally a continuous variable but is now an interval variable because it was broken up into intervals, but it is also ordinal because there is a natural order and progression in the levels of this variable.

Is there any evidence of a relationship/association between cholesterol level and heart disease?

Example: Smoking

This is an example of a \(3\times2\) table. The table below classifies 5375 high school students according to the smoking behavior of the student \(Z\) and the smoking behavior of the student’s parents \(Y\).

	Student smokes?
How many parents smoke?	Yes (Z = 1)	No (Z = 2)
Both (Y = 1)	400	1380
One (Y = 2)	416	1823
Neither (Y = 3)	188	1168

Stop and Think!

Would you call the row variable (how many parents smoke) nominal or ordinal?

By default, the row variable is ordinal because there is a natural progression in this variable because the number of parents smoking increases from 1 to 2 to both. But if you are not interested in the ordinality you can treat it as nominal.

Question: Is there a relationship of smoking behavior between the students and their parents?

Suppose that we collect data on two binary variables, \(Y\) and \(Z\), for example, "treatment" and "contracting cold" for \(n\) sample units. Binary means that these variables take two possible values say 1 (e.g. "cold") and 2 (e.g. "no cold").

\(Y\), taking possible values \(i = 1, \ldots, I\), where \(I = 2\),
\(Z\), taking possible values \(j = 1, \ldots, J\), where \(J = 2\).

The data then consist of \(n\) pairs,

\((y_1, z_1), (y_2, z_2), \ldots , (y_n, z_n)\)

which can be summarized in a frequency table.

Let \(n_{ij}\) be the number of subjects having the following characteristics \((Y = i, Z = j)\) (that is, the number of subjects falling into a particular cell of the two-way table, more specifically falling into the \(i\)th level of \(Y\) and the \(j\)th level of \(Z\)). The total sample size is \(\sum_{i=1}^I\sum_{j=1}^J n_{ij}=n\) . The levels of the first variable are represented by the index \(i\) and the levels of the second variable by index \(j\).

For the Vitamin C example data, \(n_{11} = 31\) means that in our sample we observed 31 individuals who took a placebo pill and got the cold. The counts may be arranged in a \(2\times2\) table:

	Z = 1	Z = 2
Y = 1	\(n_{11}\)	\(n_{12}\)
Y = 2	\(n_{21}\)	\(n_{22}\)

The total number of cells in the table is denoted as \(n=IJ\), which is 4 in this case. In some textbooks the authors will use \(x_{ij}\) instead of \(n_{ij}\).

The observed table \(x = (n_{11}, n_{12}, n_{21}, n_{22})\) is a summary of all \(n\) responses, e.g., the values of four counts of the Vitamin C example out of 279 total responses/individuals. We could display a contingency table X as a one-way table with four cells, but it is customary to display \(X\) as a two-dimensional table with the separate row and column variables as above. Let's see what are some other important structural elements of such tables.

Marginal Totals

When a subscript in a cell count \(n_{ij}\) is replaced by a plus sign (+) or a dot (.), it will mean that we have taken the sum of the cell counts over that subscript.

The row totals are

\(n_{1+} = n_{11} + n_{12}\)
\(n_{2+} = n_{21} + n_{22}\)

the column totals are

\(n_{+1} = n_{11} + n_{21}\)
\(n_{+2} = n_{12} + n22\)

and the grand total is \(n_{++} = n_{11} + n_{12} + n_{21} + n_{22} = n\).

These quantities are often called marginal totals, because they are conveniently placed in the margins of the table, like this:

	Z = 1	Z = 2	total
Y = 1	\(n_{11}\)	\(n_{12}\)	\(n_{1+}\)
Y = 2	\(n_{21}\)	\(n_{22}\)	\(n_{2+}\)
total	\(n_{+1}\)	\(n_{+2}\)	\(n_{++}\)

For example, the marginal totals for the Vitamin C data are \(n_{1+} = 140\), and \(n_{2+} = 139\).

Joint Distribution

If the sample units are randomly sampled from a large population, then the observed table \(x = (n_{11}, n_{12}, n_{21}, n_{22})\) will have a multinomial distribution with index \(n = n_{++}\) and a parameter vector

\(\boldsymbol{\pi} =(\pi_{11},\pi_{12},\pi_{21},\pi_{22}) =\{\pi_{ij}\}\)

where \(\pi_{ij} = P (Y = i, Z = j)\) is the probability that a randomly selected individual in the population of interest falls into the \((i, j)\)th cell of the contingency table, that is, into the \(i\)th level of \(Y\) and \(j\)th level of \(Z\).

	Z = 1	Z = 2	total
Y = 1	\(\pi_{11}\)	\(\pi_{12}\)	\(\pi_{1+}\)
Y = 2	\(\pi_{21}\)	\(\pi_{22}\)	\(\pi_{2+}\)
total	\(\pi_{+1}\)	\(\pi_{+2}\)	\(\pi_{++} = 1\)

For observed data, we may also use \(p\) instead of \(\hat{\pi}\) to represent a sample proportion. That is, \(p_{ij}=\frac{n_{ij}}{n}\) is the sample proportion of observations in the \((i, j)\)th cell.

Marginal and Conditional Distributions

If we sum the joint probabilities over one variable, we get the marginal distribution. For example, the probability distribution \({\pi_{i+}}\) is the marginal distribution for \(Y\) where \(P(Y = 1) = \pi_{1+}\) and \(P(Y = 2) = \pi_{2+}\) and \(\pi_{1+} + \pi_{2+} =1\). Then the observed marginal distribution of \(Y\) is \({p_{i+}}\). For the Vitamin C data, the observed marginal distribution of type of treatment is

\(p_{1+}= \dfrac{n_{1+}}{n} = \dfrac{140}{279} = 0.502\) and \(p_{2+}= \dfrac{n2+}{n} = \dfrac{139}{279} = 0.498\)

Think About it!

What is the marginal distribution of \(Z\) in the Vitamin C example? What is the observed marginal distribution of \(Z\) for the same example?

The conditional probability distribution is a probability of one variable given the values of other variable(s). For example, the conditional distribution of \(Z\), given the values of \(Y\) , is \(\pi_{j|I=i}=\frac{\pi_{ij}}{\pi_{i+}}\), such that \(\sum_j \pi_{j|I=i} = 1\). Intuitively, we're asking how the distribution of \(Z\) changes as the categories of \(Y\) change.

Here are the observed conditional probability distributions of \(Z\), given \(Y\).

	Z = 1	Z = 2	total
Y = 1	\(\dfrac{n_{11}}{n_{1+}}= p_{1\|1}\)	\(\dfrac{n_{12}}{n_{1+}}= p_{2\|1}\)	1
Y = 2	\(\dfrac{n21}{n2+}= p_{1\|2}\)	\(\dfrac{n_{22}}{n_{2+}}= p_{2\|2}\)	1

Let's see what this means for the distribution of cold, given treatment. There are two conditional probability distributions, depending on whether the treatment is "vitamin" or "placebo". For the subjects receiving the placebo treatment, we have \(P(\mbox{"yes"} | \mbox{"placebo"}) =31/140\) and \(P(\mbox{"no"} | \mbox{"placebo"}) =109/140\). Notice that these two values necessarily add to 1. Similarly, for the vitamin treatment, we have \(P(\mbox{"yes"} | \mbox{"vitamin"}) =17/139\) and \(P(\mbox{"no"} | \mbox{"vitamin"}) =122/139\).

Stop and Think!

What is the observed conditional probability distribution of treatment, given cold for the Vitamin C data?

Notation extension to any \(I \times J\) table

For general \(Y\) and \(Z\), the counts are usually arranged in a two-way table:

\(\begin{array}{c|cccc|} & Z=1 & Z=2 & \cdots & Z=J \\ \hline Y=1 & n_{11} & n_{12} & \cdots & n_{1 J} \\ Y=2 & n_{21} & n_{22} & \cdots & n_{2 J} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Y=I & n_{I 1} & n_{I 2} & \cdots & n_{I J} \\ \hline \end{array} \)

The total number of cells is \(n = I\times J\), and the marginal totals are:

\(n_{i+}=\sum\limits_{j=1}^J n_{ij},\qquad i=1,\ldots,I \)

\(n_{+j}=\sum\limits_{i=1}^I n_{ij},\qquad j=1,\ldots,J \)

\(n_{++}=\sum\limits_{i=1}^I \sum\limits_{j=1}^J n_{ij}=n \)

If the sample units are randomly selected from a large population, we can assume that the cell counts \(\left(n_{11}, \dots , n_{IJ}\right)\) have a multinomial distribution with index \(n_{++} = n\) and parameters

\(\pi = (\pi_{11}, \dots, \pi_{IJ})\)

This is the general multinomial model, and it is often called the saturated model, because it contains the maximum number of unknown parameters. There are \( I\times J\) unknown parameters (elements) in the vector \(\pi\) but because the elements of \(\pi\) must sum to one since this is a probability distribution, then there are really \( I\times J - 1\) unknown parameters that we need to estimate.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility