WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

All variables are summarized and univariate analysis with plots are shown below.

  Sample R code for EDA

par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(quality)), col=c("slateblue4", "slategray", "slategray1", "slategray2", "slategray3", "skyblue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(fixed.acidity, h = 0.5, col="slategray3")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(volatile.acidity, h = 0.05, col="slategray3")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(citric.acid, h = 0.1, col="slategray3")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
boxplot(fixed.acidity, col="slategray2", pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(volatile.acidity, col="slategray2", pch=19)
mtext("Volatile Acidity", cex=0.8, side=1, line=2)
boxplot(citric.acid, col="slategray2", pch=19)
mtext("Citric Acid", cex=0.8, side=1, line=2)
boxplot(residual.sugar, col="slategray2", pch=19)
mtext("Residual Sugar", cex=0.8, side=1, line=2)
boxplot(chlorides, col="slategray2", pch=19)
mtext("Chlorides", cex=0.8, side=1, line=2)

Histograms to show the distribution of the variable values:

wine quality variable summary plot

wine quality variable summary plot

wine quality variable summary plot

Boxplots for each of the variables as another indicator of spread.

wine quality variable summary plot

wine quality variable summary plot

Observations regarding variables: All variables have outliers

  • Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.
  • Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.
  • Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.
  • Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.
  • Mostly outliers are on the larger side.
  • Alcohol has an irregular shaped distribution but it does not have pronounced outliers.

  Sample R code for Summary Statistics & Correlations

library("psych", lib.loc="C:/Users/sbasu/Documents/R/R-3.1.0/library")
cor(WhiteWine[,-12], method="spearman")
pairs(WhiteWine[,-12], gap=0, pch=19, cex=0.4, col="darkblue")
title(sub="Scatterplot of Chemical Attributes", cex=0.8)

These observations are supported by the summary statistics also, as shown in the following table:

table of summary statistics

Range is much larger compared to the IQR. Mean is usually greater than the median. These observations indicate that there are outliers in the data set and before any analysis is performed outliers must be taken care of.

Next we look at the bivariate analysis, including all pairwise scatterplot and correlation coefficients. Since the variables have non-normal distribution, we have considered both person and spearman rank correlations.

Table: Pearson’s Correlation

Pearson's correlation table

Table: Spearman Rank Correlation

Spearman Rank Correlation table

Pearson’s correlation and rank correlations are very close, hence only the former is considered. High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.

Scatterplot of Predictors

scatterplot of predictors

  Sample R code for Preparing Data

limout <- rep(0,11)
for (i in 1:11){
t1 <- quantile(WhiteWine[,i], 0.75)
t2 <- IQR(WhiteWine[,i], 0.75)
limout[i] <- t1 + 1.5*t2
WhiteWineIndex <- matrix(0, 4898, 11)
for (i in 1:4898)
for (j in 1:11){
if (WhiteWine[i,j] > limout[j]) WhiteWineIndex[i,j] <- 1
WWInd <- apply(WhiteWineIndex, 1, sum)
WhiteWineTemp <- cbind(WWInd, WhiteWine)
Indexes <- rep(0, 208)
j <- 1
for (i in 1:4898){
if (WWInd[i] > 0) {Indexes[j]<- i
j <- j + 1}
else j <- j
WhiteWineLib <-WhiteWine[-Indexes,]   # Inside of Q3+1.5IQR
indexes = sample(1:nrow(WhiteWineLib), size=0.5*nrow(WhiteWineLib))
WWTrain50 <- WhiteWineLib[indexes,]
WWTest50 <- WhiteWineLib[-indexes,]

Data Preparation

Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:

  • A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR

The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4074.

Data is randomly divided into Training data and Test Data of equal sizes (50% each).