9.2.10 - R Scripts

1) Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

768 samples in the dataset
8 quantitative variables
2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix

pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 <- -1*pc.comp[,2] # principal component 2 scores (negated for convenience)

2) Linear Discriminant Analysis

The MASS package contains functions for performing linear and quadratic discriminant analysis. Both estimate the class prior probabilities by the proportion of data in each class, unless prior probabilities are specified. The code below performs LDA.

library(MASS)
model.lda <- lda(responseY ~ pc.comp1+pc.comp2)

predict()$class predicts a class label for a new sample point using an obtained model, e.g., LDA model.

predict(model.lda)$class

The classification result by LDA is shown in Figure 1. The red circles correspond to Class 1 (with diabetes), the blue circles to Class 0 (non-diabetes).

R output