14.1 - Example: Woodyard Hammock Data

Example 14-1: Woodyard Hammock Data Section

We illustrate the various methods of cluster analysis using ecological data from Woodyard Hammock, a beech-magnolia forest in northern Florida. The data involve counts of the number of trees of each species in n = 72 sites. A total of 31 species were identified and counted, however, only p = 13 of the most common species were retained and are listed below. They are:

Variable	Scientific Name	Common Name
carcar	Carpinus caroliniana	Ironwood
corflo	Cornus florida	Dogwood
faggra	Fagus grandifolia	Beech
ileopa	Ilex opaca	Holly
liqsty	Liquidambar styraciflua	Sweetgum
maggra	Magnolia grandiflora	Magnolia
nyssyl	Nyssa sylvatica	Blackgum
ostvir	Ostrya virginiana	Blue Beech
oxyarb	Oxydendrum arboreum	Sourwood
pingla	Pinus glabra	Spruce Pine
quenig	Quercus nigra	Water Oak
quemic	Quercus michauxii	Swamp Chestnut Oak
symtin	Symplocus tinctoria	Horse Sugar

The first column gives the 6-letter code identifying the species, the second column gives its scientific name (Latin binomial), and the third column gives the common name for each species. The most commonly found of these species were the beech and magnolia.

What is our objective with this data?

We hope to group sample sites together into clusters that share similar species compositions as determined by some measure of association. There are several options to measure association. Two common measures are listed below:

The measure of Association between Sample Units: We need some way to measure how similar two subjects or objects are to one another. This could be just about any type of measure of association. There is a lot of room for creativity here. However, SAS only allows Euclidean distance (defined later).
The measure of Association between Clusters: How similar are two clusters? There are dozens of techniques that can be used here.

Many different approaches to the cluster analysis problem have been proposed. The approaches generally fall into three broad categories:

Hierarchical methods
- In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster.
- In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters.
Note 1: Agglomerative methods are used much more often than divisive methods.
Note 2: Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.
Non-hierarchical methods:
- In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first “good” guess at seed points which form the initial centers of the clusters. The data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm.
- The most commonly used non-hierarchical method is MacQueen’s K-means method.
Model-based methods:
- A model-based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. We won’t cover this method any further in this course unit.

14.1 - Example: Woodyard Hammock Data

Example 14-1: Woodyard Hammock Data Section

What is our objective with this data?

Hierarchical methods

Non-hierarchical methods:

Model-based methods: