# 12.5 - R Scripts (K-means clustering)

12.5 - R Scripts (K-means clustering)## R

### 1. Acquire Data

**Diabetes data**

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes

Load data into R as follows:

```
# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep = ",",header=FALSE)
```

In `RawData`, the response variable is its last column; and the remaining columns are the predictor variables.

```
responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]
```

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

```
pca = princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp = pca$scores
pc.comp1 = -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 = -1*pc.comp[,2] # principal component 2 scores (negated for convenience)
```

### 2. K-Means

In R, `kmeans `performs the K-means clustering analysis, `()\$cluster` provides the clustering results and `()\$centers` provides the centroid vector (i.e., the mean) for each cluster.

```
X = cbind(pc.comp1, pc.comp2)
cl = kmeans(X,13)
cl\$cluster
plot(pc.comp1, pc.comp2,col=cl\$cluster)
points(cl$centers, pch=16)
```

Take **k **= 13 (as in the lecture note) as the number of clusters in K-means analysis. Figure 1 shows the resulting scatter plot with different clusters in different colors. The solid black circles are the centers of the clusters.