CD.1: Exploratory Data Analysis (EDA) and Data Pre-processing

Printer-friendly versionPrinter-friendly version

There are 40 predictors in the data. Univariate descriptive statistics and the box-plots are shown below.

R codes for
Data Preparation and
Exploratory Data Analysis


Descriptive statistics

descriptive statistics for classification data

Boxplots 

boxplot

boxplots

boxplots

boxplots

From the above univariate exercises, it is clear that several of the 40 attributes have outliers of varied proportion. In order to include as many rows as possible, but eliminating the extreme outliers, all data points (rows) were included, which do not contain any outlier value in any of the 40 predictors, outliers being defined as a value outside of [Q1-3IQR, Q3+3IQR] limits. This eliminates 128 rows. A more conservative approach is where the outlier is defined to be a point outside of the limit [Q1-1.5IQR, Q3+1.5IQR].  In this case, 1060 rows would have been removed.

While looking at the correlation matrix it was found that there is a very high degree of dependency among the predictor variables. The red highlighted cells all show very high degree of dependency and it is all in the positive direction.

correlation matrix

correlation matrix