Printer-friendly versionPrinter-friendly version

1) Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-NN only on these two dimensional data.

pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1]
pc.comp2 <- -1*pc.comp[,2]

2) K-Nearest-Neighbor

In R, knn performs KNN and it is in the class library. Again, we take the first two PCA components as the predictor variables.

library(class)
X.train <- cbind(pc.comp1, pc.comp2)
model.knn <- knn(train=X.train, test=X.train, cl=responseY, k=19, prob=T)

In the above example, we set k = 19 due to the follow reason. To assess the classification accuracy, randomly split cross validation is used to compute the error rate. The samples are split into 20% for test dataset and 80% for training dataset. Fig. 1 shows the misclassification rates based on different k in KNN.

R output

Figure 1: Training and test error rates based on cross-validation by K-NN with different k’s.

Based on Fig. 1, k = 19 yields the smallest test error rates. Therefore, we choose 19 as the best number of neighbors for KNN in this example. The classification result based on k = 19 is shown in the scatter plot of Fig. 2. The red circles represent Class 1, with diabetes, and the blue circles Class 0, non diabetes.

R output

Figure 2: Classification result based on KNN with k = 19 for the Diabetes data.

The knn function also allows leave-one-out cross-validation, which in this case suggests k=17 is optimal. Results are very similar to those for k=19. 

After completing the reading for this lesson, please finish the Quiz and R Lab on ANGEL (check the course schedule for due dates).