9.1.5 - R Scripts

1) Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes [1].

768 samples in the dataset
8 quantitative variables
2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])

For the convenience of visualization, we take the first two principal components as the new feature variables and conduct a logistic regression analysis only on these two dimensional data.

pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix

pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 <- -1*pc.comp[,2] # principal component 2 scores (negated for convenience)

2) Logistic Regression

In R, glm performs the logistic regression analysis, and ()$fitted.values stores the fitted values for the response variable.

model.logistic <- glm(responseY~pc.comp1+pc.comp2, family=binomial("logit"))

In the case of binary classification, based on the fitted values of the response variable, we classify the sample point as Class 1 if its fitted value is greater than 0.5, otherwise Class 0. The scatter plot in Figure 1 shows the classification by the Logistic regression analysis. The red circles are Class 1 (with diabetes), and the blue circles are Class 0 (non diabetes).

R output

Figure 1: The classification result by logistic regression