1. Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Load data into R as follows:
# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]
For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.
pca = princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp = pca$scores
pc.comp1 = -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 = -1*pc.comp[,2] # principal component 2 scores (negated for convenience)
2. K-Means
In R, kmeans performs the K-means clustering analysis, ()\$cluster provides the clustering results and ()\$centers provides the centroid vector (i.e., the mean) for each cluster.
X = cbind(pc.comp1, pc.comp2)
cl = kmeans(X,13)
cl\$cluster
plot(pc.comp1, pc.comp2,col=cl\$cluster)
points(cl$centers, pch=16)
Take k = 13 (as in the lecture note) as the number of clusters in K-means analysis. Figure 1 shows the resulting scatter plot with different clusters in different colors. The solid black circles are the centers of the clusters.
