1. Acquire Data
The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Load data into R as follows:
# set the working directory setwd("C:/STAT 897D data mining") # comma delimited data and no header for each variable RawData = read.table("diabetes.data",sep = ",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY = RawData[,dim(RawData)] predictorX = RawData[,1:(dim(RawData)-1)]
For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.
pca = princomp(predictorX, cor=T) # principal components analysis using correlation matrix pc.comp = pca$scores pc.comp1 = -1*pc.comp[,1] # principal component 1 scores (negated for convenience) pc.comp2 = -1*pc.comp[,2] # principal component 2 scores (negated for convenience)
In R, kmeans performs the K-means clustering analysis, ()\$cluster provides the clustering results and ()\$centers provides the centroid vector (i.e., the mean) for each cluster.
X = cbind(pc.comp1, pc.comp2) cl = kmeans(X,13) cl\$cluster plot(pc.comp1, pc.comp2,col=cl\$cluster) points(cl$centers, pch=16)
Take k = 13 (as in the lecture note) as the number of clusters in K-means analysis. Figure 1 shows the resulting scatter plot with different clusters in different colors. The solid black circles are the centers of the clusters.