3.5 - R Scripts

1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Save the data into your working directory for this course as "diabetes.data." Then load data into R as follows:

setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep =",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
predictorX = RawData[,1:(dim(RawData)[2]-1)]

2. Fitting a Linear Model

In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of response~predictors where response is the (numeric) response vector and predictors is a series of predictor variables.

Take the full model and the base model (no predictors used) as examples:

fullModel = lm(responseY~predictorX[,1]+predictorX[,2]
baseModel = lm(responseY~1)

For the full model, $coefficients shows the least square estimation for \(\hat{\beta}\) and $fitted.values are the fitted values for the response variable.


The results for the coefficients should be as follows:

(Intercept) predictorX[, 1] predictorX[, 2] predictorX[, 3] predictorX[, 4] predictorX[, 5]
	-0.8538942665 0.0205918715 0.0059202729 -0.0023318790 0.0001545198 -0.0001805345
 predictorX[, 6] predictorX[, 7] predictorX[, 8] 
0.0132440315 0.1472374386 0.0026213938

The fitted values should start with 0.6517572852.