9.1.5 - R Scripts

1) Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Load data into R as follows:
# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])
For the convenience of visualization, we take the first two principal components as the new feature variables and conduct a logistic regression analysis only on these two dimensional data.
pc.comp1 <- -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 <- -1*pc.comp[,2] # principal component 2 scores (negated for convenience)
2) Logistic Regression
In R, glm performs the logistic regression analysis, and ()$fitted.values stores the fitted values for the response variable.
model.logistic <- glm(responseY~pc.comp1+pc.comp2, family=binomial("logit"))
In the case of binary classification, based on the fitted values of the response variable, we classify the sample point as Class 1 if its fitted value is greater than 0.5, otherwise Class 0. The scatter plot in Figure 1 shows the classification by the Logistic regression analysis. The red circles are Class 1 (with diabetes), and the blue circles are Class 0 (non diabetes).
Figure 1: The classification result by logistic regression