14.2 - Measures of Association for Continuous Variables

We use the standard notation that we have been using all along:

  • \(X_{ik}\) = Response for variable k in sample unit i (the number of individual species k at site i)
  • \(n\) = Number of sample units
  • \(p\) = Number of variables

Johnson and Wichern list four different measures of association (similarity) that are frequently used with continuous variables in cluster analysis:

Some other distances also use a similar concept.

  • Euclidean Distance

    This is used most commonly. For instance, in two dimensions, we can plot the observations in a scatter plot, and simply measure the distances between the pairs of points. More generally we can use the following equation:

    \(d(\mathbf{X_i, X_j}) = \sqrt{\sum\limits_{k=1}^{p}(X_{ik} - X_{jk})^2}\)

    This is the square root of the sum of the squared differences between the measurements for each variable. (This is the only method that is available in SAS. In Minitab there are other distances like Pearson, Squared Euclidean, etc.)

  • Minkowski Distance

    \(d(\mathbf{X_i, X_j}) = \left[\sum\limits_{k=1}^{p}|X_{ik}-X_{jk}|^m\right]^{1/m}\)

    Here the square is replaced by raising the difference by a power of m and instead of taking the square root, we take the mth root.

Here are two other methods for measuring association:

  • Canberra Metric

    \(d(\mathbf{X_i, X_j}) = \sum\limits_{k=1}^{p}\frac{|X_{ik}-X_{jk}|}{X_{ik}+X_{jk}}\)

  • Czekanowski Coefficient

    \(d(\mathbf{X_i, X_j}) = 1- \frac{2\sum\limits_{k=1}^{p}\text{min}(X_{ik},X_{jk})}{\sum\limits_{k=1}^{p}(X_{ik}+X_{jk})}\)

For each distance measure, similar subjects have smaller distances than dissimilar subjects.  Similar subjects are more strongly associated.

Or, if you like, you can invent your own measure! However, whatever you invent, the measure of association must satisfy the following properties:

  1. Symmetry

    \(d(\mathbf{X_i, X_j}) = d(\mathbf{X_j, X_i})\)

    i.e., the distance between subject one and subject two must be the same as the distance between subject two and subject one.
  2. Positivity

    \(d(\mathbf{X_i, X_j}) > 0\) if \(\mathbf{X_i} \ne \mathbf{X_j}\)

    ...the distances must be positive - negative distances are not allowed!
  3. Identity

    \(d(\mathbf{X_i, X_j}) = 0\) if \(\mathbf{X_i} = \mathbf{X_j}\)

    ...the distance between the subject and itself should be zero.
  4. Triangle inequality

    \(d(\mathbf{X_i, X_k}) \le d(\mathbf{X_i, X_j}) +d(\mathbf{X_j, X_k}) \)

    This follows from a geometric consideration, that is the sum of two sides of a triangle cannot be smaller than the third side.