Example 14-1: Woodyard Hammock Data Section
We illustrate the various methods of cluster analysis using ecological data from Woodyard Hammock, a beech-magnolia forest in northern Florida. The data involve counts of the numbers of trees of each species in n = 72 sites. A total of 31 species were identified and counted, however, only p = 13 of the most common species were retained and are listed below. They are:
|ostvir||Ostrya virginiana||Blue Beech|
|pingla||Pinus glabra||Spruce Pine|
|quenig||Quercus nigra||Water Oak|
|quemic||Quercus michauxii||Swamp Chestnut Oak|
|symtin||Symplocus tinctoria||Horse Sugar|
The first column gives the 6-letter code identifying the species, the second column gives its scientific name (Latin binomial), and the third column gives the common name for each species. The most commonly found of these species were the beech and magnolia.
What is our objective with this data?
We hope to group sample sites together into clusters that share similar species compositions as determined by some measure of association. There are several options to measure association. Two common measures are listed below:
Measure of Association between Sample Units: We need some way to measure how similar two subjects or objects are to one another. This could be just about any type of measure of association. There is a lot of room for creativity here. However, SAS only allows Euclidean distance (defined later).
Measure of Association between Clusters: How similar are two clusters? There are dozens of techniques that can be used here.
Many different approaches to the cluster analysis problem have been proposed. The approaches generally fall into three broad categories:
Note 1: Agglomerative methods are used much more often than divisive methods.
- In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster.
- In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters.
Note 2: Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.
- In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first “good” guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm.
- The most commonly used non-hierarchical method is MacQueen’s K-means method.
Model based methods:
- A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. We won’t cover this method any further in this course unit.