Printer-friendly versionPrinter-friendly version

In the Coronary Heart Example, it is sensible to think of serum cholesterol as an explanatory variable and CHD as a response. Therefore, it would make sense to estimate the conditional probabilities of CHD within the four cholesterol groups.

To do this, we simply divide each cell count nij by its column total n+j ; the resulting proportion nij/n+j is an estimate of P(Y = i |Z = j). Why?

Because \(P(Y=i|Z=j)=\dfrac{P(Y=i,Z=j)}{P(Z=j)}\) is naturally estimated by \(\dfrac{n_{ij}/n_{++}}{n_{+j}/n_{++}}=\dfrac{n_{ij}}{n_{+j}}\).

These values correspond to ’Col Pct’ in the SAS output. In R, you need to calculate them based on the above formula, e.g., see HeartDisease.R, and as mentioned earlier, you can explore other R packages for computing various statistics and association measures, e.g., {vcd}  and {epitools}.

The result is shown below.

  0-199 200-219 220-259 260+
CHD 12/319
= .038
8/254
= .031
31/470
= .066
41/286
= .143
no CHD 307/319
= .962
246/254
= .969
439/470
= .934
245/286
= .857

The risk of CHD appears to be essentially constant for the groups with cholesterol levels between 0–199 and 200–219. Although the estimated probability drops from .038 to .031, this drop is not statistically significant. We can test this by doing a test for the difference in proportions or by doing a chi-square test of independence for the relevant 2 × 2 sub-table:

  0-199 200-219    
CHD 12/319
= .038
8/254
= .031
   
no CHD 307/319
= .962
246/254
= .969
   

The test yields a X2 = 0.157 with df=1, p-value = .69. For the other two groups, however, the risk of CHD is substantially higher. You can do the similar tests for other sets of cells. In fact any two levels of cholesterol may be compared and tested for association between CHD and cholesterol level.

Describing associations in I × J tables

In a 2 × 2 table, the relationship between the two binary variables could be summarized by a single number (e.g., odds ratio).

For an I × J table, the usual X2 or G2 test for independence has (IJ − 1) − (I − 1) − (J − 1) = (I − 1)(J − 1) degrees of freedom. This means that, with I > 2 or J > 2, there are multiple dimensions to the manner in which the data can depart from independence. The direction and magnitude of the departure from the null hypothesis can no longer be summarized by a single number, but must be summarized by (I −1)(J −1) numbers of (i) difference in proportions, and/or (ii) relative risk, and/or (iii) odds ratios.

In the Coronary Heart Disease study, for example, we could summarize the relationship between CHD and cholesterol level by a set of three relative risks:

  • 200–219 versus 0–199,
  • 220–259 versus 0–199, and
  • 260+ versus 0–199.

That is, we could estimate the risk of CHD at each cholesterol level relative to a common baseline.

Or, we could use

  • 200–219 versus 0–199,
  • 220–259 versus 200–219, and
  • 260+ versus 220–259,

estimating the risk of each category relative to the category immediately below. Other comparisons are also possible, but they may not make sense in interpreting the data)

Discuss   You can do this as an exercise by modifying HeartDisease.sas code or HeartDisease.R code.

Now let's look at another example.

______________________________________________________________

Example - Smoking Behaviors

The table below classifies 5375 high school students according to the smoking behavior of the student (Z) and the smoking behavior of the student’s parents (Y). We are interested in analyzing if there is a relationship of smoking behavior between the students and their parents? You can run smokeindep.sas (output, smokeindep.lst) or smokeindep.R (output, smokeindep.out)

How many parents smoke?
Student smokes?
Yes (Z = 1)
No (Z = 2)
Both (Y = 1)
400
1380
One (Y = 2)
416
1823
Neither (Y = 3)
188
1168

The test for independence yields X2 = 37.6 and G2 = 38.4 with 2 df (p-values are essentially zero), so we have decided that Y and Z are related. It is natural to think of Z in this example as a response and Y as a predictor, so we will discuss the conditional distribution of Z given Y . Let

$\pi_1 = P(Z = 1|Y = 1)$,
$\pi_2 = P(Z = 1|Y = 2)$,
 $\pi_3 = P(Z = 1|Y = 3)$.

The estimates of these probabilities are

\(\hat{\pi}_1=400/1780=0.225\)

\(\hat{\pi}_2=416/2239=0.186\)

\(\hat{\pi}_3=188/1356=0.139\)

You can then compare these as risks associated with the parameters. The effect of Y on Z can be summarized with two differences. For example, we can calculate the increase in the probability of Z = 1 as Y goes from 3 to 2, and as Y goes from 2 to 1:

\(\hat{d}_{23}=\hat{\pi}_2-\hat{\pi}_3=0.047\)

\(\hat{d}_{12}=\hat{\pi}_1-\hat{\pi}_2=0.039\)

Alternatively, we may treat Y = 3 as a baseline and calculate the increase in probability as we go from Y = 3 to Y = 2 and from Y = 3 to Y = 1: 

\(\hat{d}_{23}=\hat{\pi}_2-\hat{\pi}_3=0.047\)

\(\hat{d}_{13}=\hat{\pi}_1-\hat{\pi}_3=0.086\)

We may also express the effects as the sample odds ratios (e.g., look at any 2 x 2 table within this larger 3 x 2 table):

\(\hat{\theta}_{23}=\dfrac{416\times 1168}{188\times 1823}=1.42\)

\(\hat{\theta}_{13}=\dfrac{400\times 1168}{188\times 1380}=1.80\)

The estimated value of 1.42 means that students with one smoking parent are estimated to be 42% more likely (on the odds scale) to smoke than students whose parents do not smoke (the last two rows of the table). The value of 1.80 means that students with two smoking parents are 80% more likely to smoke than students whose parents do not smoke (the first and the last rows of the table).

In a 3 × 2 table, the relationship between the two variables must be summarized with two differences in proportions or two relative risks or two odds ratios. More generally, to describe the relationship between the two variables in an I × J table will require (I − 1)(J − 1) numbers. You can specify a large number of different odds ratios depending on the size of the table, yet the minimum number of these ratios that efficiently describes the data is described as (I - 1)(J - 1) number of ratios. There is a relationship between the minimum number of odds ratios and df for testing independence. Which odds ratios are most meaningful to the researcher depends on the research question at hand.

Besides the point estimates, we can also test hypotheses about the odds ratios or compute confidence intervals. You could do the same for the relative risks or difference in proportions as we discussed in previous sections. To do this computationally in SAS and/or R you need to analyze each 2x2 sub-table separately. Basically treat each 2 × 2 table as a "new" data set.

Certain options/functions in SAS and R will give you all possible association measures for the table of interest, and there are many! However, we will limit our discussion to only a few in this class. For additional information on, for example, Gamma, Kendall's tau, Cramer's V, etc, you can read the textbook, other online sources, and post any questions regarding these measures on the Discussion board. 

In SAS the OPTION ALL should give you all possible measures, see: smokeindep.sas (output, smokeindep.lst). Depending which SAS version you are using the OPTIONS may be different, e.g., RELRISK, RRC1, RRC2, etc... and some of them work only for 2 x 2 tables. For the current list see: https://support.sas.com/documentation/cdl/en/procstat/66703/HTML/default/viewer.htm#procstat_freq_syntax07.htm

In R, see smokeindep.R (output, smokeindep.out). The {vcd}  package has a number of useful functions, e.g., oddsratio(), assocstats(); the latter will give you X2, G2 and some other measures of associations such as Cramer's V. Again, you can also explore the {epitools}.