3.2.3  Measures of Associations in I x J tables
In the Coronary Heart Example, it is sensible to think of serum cholesterol as an explanatory variable and CHD as a response. Therefore, it would make sense to estimate the conditional probabilities of CHD within the four cholesterol groups.
To do this, we simply divide each cell count n_{ij} by its column total n_{+j} ; the resulting proportion n_{ij}/n_{+j} is an estimate of P(Y = i Z = j). Why?
Because \(P(Y=iZ=j)=\dfrac{P(Y=i,Z=j)}{P(Z=j)}\) is naturally estimated by \(\dfrac{n_{ij}/n_{++}}{n_{+j}/n_{++}}=\dfrac{n_{ij}}{n_{+j}}\).
These values correspond to ’Col Pct’ in the SAS output. In R, you need to calculate them based on the above formula, e.g., see HeartDisease.R, and as mentioned earlier, you can explore other R packages for computing various statistics and association measures, e.g., {vcd} and {epitools}.
The result is shown below.
0199  200219  220259  260+  
CHD  12/319 = .038 
8/254 = .031 
31/470 = .066 
41/286 = .143 
no CHD  307/319 = .962 
246/254 = .969 
439/470 = .934 
245/286 = .857 
The risk of CHD appears to be essentially constant for the groups with cholesterol levels between 0–199 and 200–219. Although the estimated probability drops from .038 to .031, this drop is not statistically significant. We can test this by doing a test for the difference in proportions or by doing a chisquare test of independence for the relevant 2 × 2 subtable:
0199  200219  
CHD  12/319 = .038 
8/254 = .031 

no CHD  307/319 = .962 
246/254 = .969 
The test yields a X^{2} = 0.157 with df=1, pvalue = .69. For the other two groups, however, the risk of CHD is substantially higher. You can do the similar tests for other sets of cells. In fact any two levels of cholesterol may be compared and tested for association between CHD and cholesterol level.
Describing associations in I × J tables
In a 2 × 2 table, the relationship between the two binary variables could be summarized by a single number (e.g., odds ratio).
For an I × J table, the usual X^{2} or G^{2} test for independence has (IJ − 1) − (I − 1) − (J − 1) = (I − 1)(J − 1) degrees of freedom. This means that, with I > 2 or J > 2, there are multiple dimensions to the manner in which the data can depart from independence. The direction and magnitude of the departure from the null hypothesis can no longer be summarized by a single number, but must be summarized by (I −1)(J −1) numbers of (i) difference in proportions, and/or (ii) relative risk, and/or (iii) odds ratios.
In the Coronary Heart Disease study, for example, we could summarize the relationship between CHD and cholesterol level by a set of three relative risks:
 200–219 versus 0–199,
 220–259 versus 0–199, and
 260+ versus 0–199.
That is, we could estimate the risk of CHD at each cholesterol level relative to a common baseline.
Or, we could use
 200–219 versus 0–199,
 220–259 versus 200–219, and
 260+ versus 220–259,
estimating the risk of each category relative to the category immediately below. Other comparisons are also possible, but they may not make sense in interpreting the data)
You can do this as an exercise by modifying HeartDisease.sas code or HeartDisease.R code.
Now let's look at another example.
______________________________________________________________
Example  Smoking Behaviors
The table below classifies 5375 high school students according to the smoking behavior of the student (Z) and the smoking behavior of the student’s parents (Y). We are interested in analyzing if there is a relationship of smoking behavior between the students and their parents? You can run smokeindep.sas (output, smokeindep.lst) or smokeindep.R (output, smokeindep.out)
How many parents smoke? 
Student smokes?


Yes (Z = 1)

No (Z = 2)


Both (Y = 1) 
400

1380

One (Y = 2) 
416

1823

Neither (Y = 3) 
188

1168

The test for independence yields X^{2} = 37.6 and G^{2} = 38.4 with 2 df (pvalues are essentially zero), so we have decided that Y and Z are related. It is natural to think of Z in this example as a response and Y as a predictor, so we will discuss the conditional distribution of Z given Y . Let
$\pi_1 = P(Z = 1Y = 1)$,
$\pi_2 = P(Z = 1Y = 2)$,
$\pi_3 = P(Z = 1Y = 3)$.
The estimates of these probabilities are
\(\hat{\pi}_1=400/1780=0.225\)
\(\hat{\pi}_2=416/2239=0.186\)
\(\hat{\pi}_3=188/1356=0.139\)
You can then compare these as risks associated with the parameters. The effect of Y on Z can be summarized with two differences. For example, we can calculate the increase in the probability of Z = 1 as Y goes from 3 to 2, and as Y goes from 2 to 1:
\(\hat{d}_{23}=\hat{\pi}_2\hat{\pi}_3=0.047\)
\(\hat{d}_{12}=\hat{\pi}_1\hat{\pi}_2=0.039\)
Alternatively, we may treat Y = 3 as a baseline and calculate the increase in probability as we go from Y = 3 to Y = 2 and from Y = 3 to Y = 1:
\(\hat{d}_{23}=\hat{\pi}_2\hat{\pi}_3=0.047\)
\(\hat{d}_{13}=\hat{\pi}_1\hat{\pi}_3=0.086\)
We may also express the effects as the sample odds ratios (e.g., look at any 2 x 2 table within this larger 3 x 2 table):
\(\hat{\theta}_{23}=\dfrac{416\times 1168}{188\times 1823}=1.42\)
\(\hat{\theta}_{13}=\dfrac{400\times 1168}{188\times 1380}=1.80\)
The estimated value of 1.42 means that students with one smoking parent are estimated to be 42% more likely (on the odds scale) to smoke than students whose parents do not smoke (the last two rows of the table). The value of 1.80 means that students with two smoking parents are 80% more likely to smoke than students whose parents do not smoke (the first and the last rows of the table).
In a 3 × 2 table, the relationship between the two variables must be summarized with two differences in proportions or two relative risks or two odds ratios. More generally, to describe the relationship between the two variables in an I × J table will require (I − 1)(J − 1) numbers. You can specify a large number of different odds ratios depending on the size of the table, yet the minimum number of these ratios that efficiently describes the data is described as (I  1)(J  1) number of ratios. There is a relationship between the minimum number of odds ratios and df for testing independence. Which odds ratios are most meaningful to the researcher depends on the research question at hand.
Besides the point estimates, we can also test hypotheses about the odds ratios or compute confidence intervals. You could do the same for the relative risks or difference in proportions as we discussed in previous sections. To do this computationally in SAS and/or R you need to analyze each 2x2 subtable separately. Basically treat each 2 × 2 table as a "new" data set.
Certain options/functions in SAS and R will give you all possible association measures for the table of interest, and there are many! However, we will limit our discussion to only a few in this class. For additional information on, for example, Gamma, Kendall's tau, Cramer's V, etc, you can read the textbook, other online sources, and post any questions regarding these measures on the Discussion board.
In SAS the OPTION ALL should give you all possible measures, see: smokeindep.sas (output, smokeindep.lst). Depending which SAS version you are using the OPTIONS may be different, e.g., RELRISK, RRC1, RRC2, etc... and some of them work only for 2 x 2 tables. For the current list see: https://support.sas.com/documentation/cdl/en/procstat/66703/HTML/default/viewer.htm#procstat_freq_syntax07.htm
In R, see smokeindep.R (output, smokeindep.out). The {vcd} package has a number of useful functions, e.g., oddsratio(), assocstats(); the latter will give you X^{2}, G^{2} and some other measures of associations such as Cramer's V. Again, you can also explore the {epitools}.