# 12.5 - R Scripts (K-means clustering)

12.5 - R Scripts (K-means clustering)

## R

### 1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

• 768 samples in the dataset
• 8 quantitative variables
• 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

pca = princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp = pca$scores pc.comp1 = -1*pc.comp[,1] # principal component 1 scores (negated for convenience) pc.comp2 = -1*pc.comp[,2] # principal component 2 scores (negated for convenience) ### 2. K-Means In R, kmeans performs the K-means clustering analysis, ()\$cluster provides the clustering results and ()\$centers provides the centroid vector (i.e., the mean) for each cluster. X = cbind(pc.comp1, pc.comp2) cl = kmeans(X,13) cl\$cluster
plot(pc.comp1, pc.comp2,col=cl\$cluster) points(cl$centers, pch=16)

Take k = 13 (as in the lecture note) as the number of clusters in K-means analysis. Figure 1 shows the resulting scatter plot with different clusters in different colors. The solid black circles are the centers of the clusters.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility