Example 141: Woodyard Hammock Data Section
We illustrate the various methods of cluster analysis using ecological data from Woodyard Hammock, a beechmagnolia forest in northern Florida. The data involve counts of the numbers of trees of each species in n = 72 sites. A total of 31 species were identified and counted, however, only p = 13 of the most common species were retained and are listed below. They are:
carcar  Carpinus caroliniana  Ironwood 
corflo  Cornus florida  Dogwood 
faggra  Fagus grandifolia  Beech 
ileopa  Ilex opaca  Holly 
liqsty  Liquidambar styraciflua  Sweetgum 
maggra  Magnolia grandiflora  Magnolia 
nyssyl  Nyssa sylvatica  Blackgum 
ostvir  Ostrya virginiana  Blue Beech 
oxyarb  Oxydendrum arboreum  Sourwood 
pingla  Pinus glabra  Spruce Pine 
quenig  Quercus nigra  Water Oak 
quemic  Quercus michauxii  Swamp Chestnut Oak 
symtin  Symplocus tinctoria  Horse Sugar 
The first column gives the 6letter code identifying the species, the second column gives its scientific name (Latin binomial), and the third column gives the common name for each species. The most commonly found of these species were the beech and magnolia.
What is our objective with this data?
We hope to group sample sites together into clusters that share similar species compositions as determined by some measure of association. There are several options to measure association. Two common measures are listed below:

Measure of Association between Sample Units: We need some way to measure how similar two subjects or objects are to one another. This could be just about any type of measure of association. There is a lot of room for creativity here. However, SAS only allows Euclidean distance (defined later).

Measure of Association between Clusters: How similar are two clusters? There are dozens of techniques that can be used here.
Many different approaches to the cluster analysis problem have been proposed. The approaches generally fall into three broad categories:

Hierarchical methods
 In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster.
 In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters.
Note 1: Agglomerative methods are used much more often than divisive methods.Note 2: Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.

Nonhierarchical methods:
 In a nonhierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first “good” guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm.
 The most commonly used nonhierarchical method is MacQueen’s Kmeans method.

Model based methods:
 A model based method uses a mixture model to specify the density function of the xvariables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. We won’t cover this method any further in this course unit.