3.5 - R Scripts

1) Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Save the data into your working directory for this course as "diabetes.data." Then load data into R as follows:
# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY <- RawData[,dim(RawData)[2]]
predictorX <- RawData[,1:(dim(RawData)[2]-1)]
2) Fitting a Linear Model
In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of response ~ predictors where response is the (numeric) response vector and predictors is a series of predictor variables.
Take the full model and the base model (no predictors used) as examples:
fullModel <- lm(responseY~predictorX[,1]+predictorX[,2]
+predictorX[,3]+predictorX[,4]+predictorX[,5]
+predictorX[,6]+predictorX[,7]+predictorX[,8])
baseModel <- lm(responseY~1)
For the full model, \$coefficients shows the least square estimation for \(\hat{\beta}\) and \$fitted.values are the fitted values for the response variable.
fullModel\$coefficients
fullModel\$fitted.values