3.5 - R Scripts
3.5 - R ScriptsR
1. Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Save the data into your working directory for this course as "diabetes.data." Then load data into R as follows:
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep =",",header=FALSE)
In RawData, the response variable is its last column; and the remaining columns are the predictor variables.
responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]
2. Fitting a Linear Model
In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of response~predictors where response is the (numeric) response vector and predictors is a series of predictor variables.
Take the full model and the base model (no predictors used) as examples:
fullModel = lm(responseY~predictorX[,1]+predictorX[,2]
+predictorX[,3]+predictorX[,4]+predictorX[,5]
+predictorX[,6]+predictorX[,7]+predictorX[,8])
baseModel = lm(responseY~1)
For the full model, $coefficients shows the least square estimation for \(\hat{\beta}\) and $fitted.values are the fitted values for the response variable.
fullModel$coefficients
fullModel$fitted.values
The results for the coefficients should be as follows:
(Intercept) predictorX[, 1] predictorX[, 2] predictorX[, 3] predictorX[, 4] predictorX[, 5]
-0.8538942665 0.0205918715 0.0059202729 -0.0023318790 0.0001545198 -0.0001805345
predictorX[, 6] predictorX[, 7] predictorX[, 8]
0.0132440315 0.1472374386 0.0026213938
The fitted values should start with 0.6517572852.