12.8 - R Scripts (Agglomerative Clustering)

1) Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes .
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Load data into R as follows:
# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY <- RawData[,dim(RawData)[2]]
predictorX <- RawData[,1:(dim(RawData)[2]-1)]
2) Agglomerative Clustering
In R, library cluster implements hierarchical clustering using the agglomerative nesting algorithm (agnes). The first argument x in agnes specifies the input data matrix or the dissimilarity matrix, depending on the value of the diss argument. If diss=TRUE, x is assumed to be a dissimilarity matrix. If diss=FALSE, x is treated as a matrix of observations. The argument stand = TRUE indicates that the data matrix is standardized before calculating the dissimilarities.
Each variable (a column in the data matrix) is standardized by first subtracting the mean value of the variable and then dividing the result by the mean absolute deviation of the variable. If x is already a dissimilarity matrix, this argument will be ignored.
To merge two clusters into a new cluster, the argument method specifies the measurement of between-cluster distance. method="single" is for single linkage clustering, method="complete" for complete linkage clustering, and method="average" for average linkage clustering. The default is method="average".
For clarity of illustration, we use only the first 25 observations to run the agglomerative nesting algorithm (agnes). The function as.dendrogram generates a dendrogram using as input the agglomerative clustering result obtained by agnes.
library(cluster)
agn <- agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
method = "average")
DendAgn <-as.dendrogram(agn)
plot(DendAgn)
Figure 1 shows the clustering result by average linkage clustering.
Figure 1: Result of average linkage clustering. The numbers at the bottom are the indices for the data points, ranging from 1 to 25.
Figure 2 shows the clustering result by single linkage, executed by the codes below.
agn <- agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
method = "single")
DendAgn <-as.dendrogram(agn)
plot(DendAgn)
Figure 2: Result of single linkage clustering
Figure 3 shows the result by complete linkage, executed by the codes below.
agn <- agnes(x=predictorX[1:25,], diss = FALSE, stand = TRUE,
method = "complete")
DendAgn <-as.dendrogram(agn)
plot(DendAgn)
Figure 3: Result of complete linkage clustering