12.8 - R Scripts (Agglomerative Clustering)

12.8 - R Scripts (Agglomerative Clustering)

R

1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma-delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]

2. Agglomerative Clustering

In R, library cluster implements hierarchical clustering using the agglomerative nesting algorithm (agnes). The first argument x in agnes specifies the input data matrix or the dissimilarity matrix, depending on the value of the diss argument. If diss=TRUE, x is assumed to be a dissimilarity matrix. If diss=FALSE, x is treated as a matrix of observations. The argument stand = TRUE indicates that the data matrix is standardized before calculating the dissimilarities.

Each variable (a column in the data matrix) is standardized by first subtracting the mean value of the variable and then dividing the result by the mean absolute deviation of the variable. If x is already a dissimilarity matrix, this argument will be ignored.

To merge two clusters into a new cluster, the argument method specifies the measurement of between-cluster distance. method="single" is for single linkage clustering, method="complete" for complete linkage clustering, and method="average" for average linkage clustering. The default is method="average".

For clarity of illustration, we use only the first 25 observations to run the agglomerative nesting algorithm (agnes). The function as.dendrogram generates a dendrogram using as input the agglomerative clustering result obtained by agnes.

library(cluster)
agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "average")
DendAgn =as.dendrogram(agn)
plot(DendAgn)

Figure 1 shows the clustering result by average linkage clustering.

R output
Figure 1: Result of average linkage clustering. The numbers at the bottom are the indices for the data points, ranging from 1 to 25. graphic.

Figure 2 shows the clustering result by single linkage, executed by the codes below.

agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "single")
DendAgn =as.dendrogram(agn)
plot(DendAgn)
R output
Figure 2: Result of single linkage clustering

Figure 3 shows the result by complete linkage, executed by the codes below.

agn = agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
      method = "complete")
DendAgn =as.dendrogram(agn)
plot(DendAgn)
R output
Figure 3: Result of complete linkage clustering graphic.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility