R Scripts

1) Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database repository at: http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
> setwd("C:/STAT 897D data mining");
# comma delimited data and no header for each variable
> RawData <- read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

> responseY <- as.matrix(RawData[,dim(RawData)[2]]);
> predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)]);

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

> pc.comp <- princomp(scale(predictorX))$scores
> pc.comp1 <- -1*pc.comp[,1]; 
> pc.comp2 <- -1*pc.comp[,2];

2) Linear Discriminant Analysis

The MASS package contains functions for performing linear and quadratic discriminant analysis. Both estimate the class prior probabilities by the proportion of data in each class, unless prior probabilities are specified. The code below performs LDA.

> library(MASS)
> flda <- lda(responseY ~ pc.comp1+pc.comp2,data=RawData);

predict()$class predicts a class label for a new sample point using an obtained model, e.g., LDA model.

> predict(flda, RawData)$class;

The classification result by LDA is shown in Figure 1. The red circles correspond to Class 1 (with diabetes), the blue circles to Class 0 (non-diabetes).

R output

Figure 1: Classification result by LDA

3) Quadratic Discriminant Analysis

Quadratic discriminant analysis does not assume homogeneity of the covariance matrices of all the class. The following code performs the QDA.

> fqda <- qda(responseY ~ pc.comp1+pc.comp2,data=RawData);

Again, we can use predict()$class to obtain classification result based on the estimated QDA model.

> predict(fqda, RawData)$class;

The classification result by QDA is shown in Figure 2.

R output

Figure 2: Classification result by QDA

4) LDA on Expanded Basis

We consider the expanded bases for the first two principal components, e.g., X1X2, X12, X22 . Therefore, the input variables for LDA are {X1X2X1X2, X12, X22}.

The resulting scatter plot for this model is shown in Figure 3. We can see that the decision boundary is a quadratic curve.

R output

Figure 3: Classification result by LDA on expanded basis