Lesson 14: Cluster Analysis

Overview Section

Cluster analysis is a data exploration (mining) tool for dividing a multivariate dataset into “natural” clusters (groups). We use the methods to explore whether previously undefined clusters (groups) exist in the dataset. For instance, a marketing department may wish to use survey results to sort its customers into categories (perhaps those likely to be most receptive to buying a product, those most likely to be against buying a product, and so forth).

Cluster Analysis is used when we believe that the sample units come from an unknown number of distinct populations or sub-populations. We also assume that the sample units come from a number of distinct populations, but there is no apriori definition of those populations. Our objective is to describe those populations with the observed data.

Cluster Analysis, until relatively recently, has had very little interest. This has changed because of the interest in bioinformatics and genome research. We will use an ecological example in our lesson.

Objectives

Upon completion of this lesson, you should be able to:

  • Carry out cluster analysis using SAS or Minitab;
  • Use a dendrogram to partition the data into clusters of known composition;
  • Carry out post hoc analyses to describe differences among clusters.