14.3 - Measures of Association for Binary Variables

In the Woodyard Hammock example, the observer recorded how many individuals belong to each species at each site.  However, other research methods might only record whether or not the species was present at a site.  In sociological studies, we might look at traits that some people have and others do not. Typically 1(0) signifies that the trait of interest is present (absent).

For sample units i and j, consider the following contingency table of frequencies of 1-1, 1-0, 0-1, and 0-0 matches across the variables:

Unit j
Unit i   1 0 Total
1 a b a + b
0 c d c + d
Total a + c b + d p = a + b + c + d

If we are comparing two subjects, subject I, and subject j, then a is the number of variables present for both subjects.  In the Woodyard Hammock example, this is the number of species found at both sites.  Similarly, b is the number (of species) found in subject i but not subject j, c is just the opposite, and d is the number that is not found in either subject.

From here we can calculate row totals, column totals, and a grand total.

Johnson and Wichern list the following Similarity Coefficients used for binary data:

Coefficient Rationale
\( \dfrac { a + d } { p }\) Equal weights for 1-1, 0-0 matches
\( \dfrac { 2 ( a + d ) } { 2 ( a + d ) + b + c }\) Double weights for 1-, 0-0 matches
\( \dfrac { a + d } { a + d + 2 ( b + c ) }\) Double weights for unmatched pairs
\( \dfrac { a } { p }\) Proportion of 1-1 matches
\( \dfrac { a } { a + b + c }\) 0-0 matches are irrelevant
\( \dfrac { 2 a } { 2 a + b + c }\)

0-0 matches are irrelevant

Double weights for 1-1 matches

\( \dfrac { a } { a + 2 ( b + c ) }\) 0-0 matches are irrelevant
\( \dfrac { a } { b + c }\)

Double weights for unmatched pairs

Ratio of 1-1 matches to mismatches

The first coefficient looks at the number of matches (1-1 or 0-0) and divides it by the total number of variables. If two sites have identical species lists, then this coefficient is equal to one because c = b = 0. The more species that are found at one and only one of the two sites, the smaller the value for this coefficient. If no species in one site are found in the other site, then this coefficient takes a value of zero because a = d = 0.

The remaining coefficients give different weights to matched (1-1 or 0-0) or mismatched (1-0 or 0-1) pairs.  For example, the second coefficient gives matched pairs double the weight and thus emphasizes agreements in the species lists. In contrast, the third coefficient gives mismatched pairs double the weight, more strongly penalizing disagreements between the species lists. The remaining coefficients ignore species not found in either site.

The choice of coefficient will have an impact on the results of the analysis. Coefficients may be selected based on theoretical considerations specific to the problem at hand, or so as to yield the most parsimonious description of the data. For the latter, the analysis may be repeated using several of these coefficients. The coefficient that yields the most easily interpreted results is selected.

The main thing is that you need some measure of association between your subjects before the analysis can proceed.  We will look next at methods of measuring distances between clusters.