# 14.9 - Defining Initial Clusters

14.9 - Defining Initial Clusters

Now that you have a good idea of what is going to happen, we need to go back to our original question for this method... How should we define the initial clusters? Again, there are two main approaches that are taken to define the initial clusters.

## 1st Approach: Random assignment

The first approach is to assign the clusters randomly. This does not seem like it would be a very efficient approach. The main reason to take this approach would be to avoid any bias in this process.

The second approach is to use a Leader Algorithm. (Hartigan, J.A., 1975, Clustering Algorithms). This involves the following procedure:

• Step 1. Select the first item from the list. This item forms the centroid of the initial cluster.
• Step 2. Search through the subsequent items until an item is found that is at least distance δ away from any previously defined cluster centroid. This item will form the centroid of the next cluster.
• Step 3: Step 2 is repeated until all K cluster centroids are obtained or no further items can be assigned.
• Step 4: The initial clusters are obtained by assigning items to the nearest cluster centroids.

The following video illustrates this procedure for k = 4 clusters and p = 2 variables plotted in a scatter plot:

## Example 14-5: Woodyard Hammock Data (Initial Clusters)

Now, let's take a look at each of these options, in turn, using our Woodyard Hammock dataset.

We first must determine:

• The number of clusters K
• The radius $$δ$$ for the leader algorithm.

In some applications, the theory specific to the discipline may suggest reasonable values for K.  In general, however, there is no prior knowledge that can be applied to find K.  Our approach is to apply the following procedure for various values of K.  For each K, we obtain a description of the resulting clusters. The value of K is then selected to yield the most meaningful description. We wish to select K large enough so that the composition of the individual clusters is uniform, but not so large as to yield too complex a description for the resulting clusters.

Here, we shall take K = 4 and use the random assignment approach to find a reasonable value for $$δ$$.

This random approach is implemented in SAS using the following program below.

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;
title "Cluster Analysis - Woodyard Hammock - K-Means";

/* After reading in the data, an ident variable is created as the
* row number for each observation. This is neeed for the clustering algorithm.
* The drop statement removes several variables not used for this analysis.
*/

data wood;
infile 'D:\Statistics\STAT 505\data\wood.csv' firstobs=2 delimiter=',';
input x y acerub carcar carcor cargla cercan corflo faggra frapen
ileopa liqsty lirtul maggra magvir morrub nyssyl osmame ostvir
oxyarb pingla pintae pruser quealb quehem quenig quemic queshu quevir
symtin ulmala araspi cyrrac;
ident=_n_;
drop acerub carcor cargla cercan frapen lirtul magvir morrub osmame pintae
pruser quealb quehem queshu quevir ulmala araspi cyrrac;
run;

/* The observations are sorted by their ident value.
*/

proc sort data=wood;
by ident;
run;

/* The fastclus is a non-hierarchical procedure.
* The maxclusters option is the number it works with
* throughout the algorithm. The replace option specifies
* the way seeds are replaced.
*/

proc fastclus data=wood maxclusters=4 replace=random;
var carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb
pingla quenig quemic symtin;
id ident;
run;

We use the fastclus procedure, which stands for fast cluster analysis. This is designed specifically to develop results quickly, especially with very large datasets. Remember, unlike the previous cluster analysis methods, we will not get a tree diagram out of this procedure.

We need to first specify the number of clusters that we want to include.  In this case, we ask for four clusters. Then, we set replace=random, indicating the initial cluster of centroids will be randomly selected from the study subjects (sites).

## Perform a cluster analysis (kmeans)

### To perform k-means cluster analysis:

1. Open the ‘wood’ data set in a new worksheet.
2. Stat > Multivariate > Cluster K-means
1. Highlight and select the variables to use in the clustering. For this example, the following variables are selected (13 total): carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin
2. Choose 4 as the number of clusters.
3. In Storage, enter c35 (or the name of any blank column in the worksheet) in Cluster membership column.
3. Choose OK, and OK again. In addition to the session window results, the cluster assignments are stored in the c35 column, which can be used for any subsequent ANOVA analysis.

When you run this program, you will always get different results because a different random set of subjects is selected each time.

The first part of the output gives the initial cluster centers.  SAS picks four sites at random and lists how many species of each tree are at each site.

The procedure works iteratively until no reassignments can be obtained. The following table was copied from the SAS output for discussion purposes.

Cluster Maximum Point to Centroid Distance Nearest Cluster Distance to Closest Cluster
1 21.1973 3 16.5910
2 20.2998 3 13.0501
3 22.1861 2 13.0501
4 23.1866 3 15.8186

In this case, we see that cluster 3 is the nearest neighboring cluster to cluster 1 and the distance between those two clusters is 16.591.

To set the delta for the leader algorithm, we want to pay attention to the maximum distances between the cluster centroids and the furthest site in that cluster. We can see that all of the maximum distances exceed 20.  Based on these results, we set the radius $$δ = 20$$.

Now, we can turn to the SAS program below where this radius $$δ$$ value is used to run the Leader Algorithmic approach.

Here is the SAS program modified to accommodate these changes:

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;
title "Cluster Analysis - Woodyard Hammock - K-Means";

/* After reading in the data, an ident variable is created as the
* row number for each observation. This is neeed for the clustering algorithm.
* The drop statement removes several variables not used for this analysis.
*/

data wood;
infile 'D:\Statistics\STAT 505\data\wood.csv' firstobs=2 delimiter=',';
input x y acerub carcar carcor cargla cercan corflo faggra frapen
ileopa liqsty lirtul maggra magvir morrub nyssyl osmame ostvir
oxyarb pingla pintae pruser quealb quehem quenig quemic queshu quevir
symtin ulmala araspi cyrrac;
ident=_n_;
drop acerub carcor cargla cercan frapen lirtul magvir morrub osmame pintae
pruser quealb quehem queshu quevir ulmala araspi cyrrac;
run;

/* The observations are sorted by their ident value.
*/

proc sort data=wood;
by ident;
run;

/* The fastclus is a non-hierarchical procedure.
* The maxclusters option is the number it works with
* throughout the algorithm. The radius option specifies
* the minimum distance between new seeds.
* The maxiter option specifies the number of iterations.
*/

proc fastclus data=wood maxclusters=4 radius=20 maxiter=100 out=clust;
var carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb
pingla quenig quemic symtin;
id ident;
run;

The fastclus procedure is used again, only this time with the leader algorithm options specified.

We set the maximum number of clusters to four and also set the radius to equal 20, the delta value that we decided on earlier.

Again, the output produces the initial cluster of centroids. Given the first site, it will go down the list of the sites until it finds another site that is at least 20 away from this first point. The first one it finds forms the second cluster centroid. Then it goes down the list until it finds another site that is at least 20 away from the first two to form the third cluster centroid. Finally, the fourth cluster is formed by searching until it finds a site that is at least 20 away from the first three.

SAS also provides an iteration history showing what happens during each iteration of the algorithm. The algorithm stops after five iterations, showing the changes in the location of the centroids. In other words, convergence was achieved after 5 iterations.

Next, the SAS output provides a cluster summary that gives the number of sites in each cluster. It also tells you which cluster is closest. From this, it seems that Cluster 1 is in the middle because three of the clusters (2,3, and 4) are closest to Cluster 1 and not the other clusters.  The distances between the cluster centroids and their nearest neighboring clusters are reported, i.e., Cluster 1 is 14.3 away from Cluster 4.  The SAS output from all four clusters is in the table below:

Cluster Size Nearest Neighbor Distance
1 28 4 14.3126
2 9 1 17.6003
3 18 1 19.3971
4 17 1 14.3126

In comparing these spacings with the spacing we found earlier, you will notice that these clusters are more widely spaced than the previously defined clusters.

The output of fastclus also gives the results of individual ANOVAs for each species. However, only the $$r^{2}$$ values for each ANOVAs are presented. The $$r^{2}$$ values are computed, as usual, by dividing the model sum of squares by the total sum of squares. These are summarized in the following table:

Code Species $$\boldsymbol{r^{2}}$$ $$\boldsymbol{r ^ { 2 } / \left( 1 - r ^ { 2 } \right)}$$ F
carcar Ironwood 0.785 3.685 82.93
corflo Dogwood 0.073 0.079 1.79
faggra Beech 0.299 0.427 9.67
ileopa Holly 0.367 0.579 13.14
liqsty Sweetgum 0.110 0.123 2.80
maggra Magnolia 0.199 0.249 5.64
nyssyl Blackgum 0.124 0.142 3.21
ostvir Blue Beech 0.581 1.387 31.44
oxyarb Sourwood 0.110 0.124 2.81
pingla Spruce Pine 0.033 0.034 0.76
quenig Water Oak 0.119 0.135 3.07
quemic Swamp Chestnut Oak 0.166 0.199 4.50
symtin Horse Sugar 0.674 2.063 46.76

Given $$r^{2}$$ , the F-statistic is:

$$F = \dfrac{r^2/(K-1)}{(1-r^2)/(n-K)}$$

where K-1 is the degrees of freedom between clusters and n-K is the degrees of freedom within clusters.

In our example, n = 72 and K = 4.  If we take the ratio of $$r^{2}$$ divided by 1-$$r^{2}$$, multiply the result by 68, and divide by 3, we arrive at the F-values in the table.

Each of these F-values is tested at K - 1 = 3 and n - K = 68 degrees of freedom.  Using the Bonferroni correction, the critical value for an $$α = 0.05$$ level test is $$F_{3,68,0.05/13} = 4.90$$. Therefore, anything above 4.90 will be significant here.  In this case, the species in boldface in the table above are the species where the F-value is above 4.90.

Let's look at the cluster means for the significant species identified above.  The species and the species' means are listed in the table below.  As before, the larger numbers within each row are boldfaced.  As a result, you can see that ironwood is most abundant in Cluster 3, Beech is most abundant in Cluster 1, and so forth...

Cluster Species 1 2 3 Ironwood 4.1 7.2 21.2 2.1 Beech 11.1 6.1 5.7 6.2 Holly 5.5 5.9 4.4 13.2 Magnolia 5.3 3.3 2.8 3.0 Blue Beech 4.5 5.3 2.4 14.6 Horse Sugar 0.9 16.1 0.6 2.2

In looking down at the columns of the table, we can characterize the individual clusters:

• Cluster 1: Primarily Beech and Magnolia: These are the large canopy species typical of old-growth forests.
• Cluster 2: Primarily Horse Sugar: These are small understory species typical of small-scale disturbances (light gaps) in the forest.
• Cluster 3: Primarily Ironwood: This is an understory species typical of wet habitats.
• Cluster 4: Primarily Holly and Blue Beech: This is also an understory species typical of dry habitats.

  Link ↥ Has Tooltip/Popover Toggleable Visibility