Analysis of Wine Quality Data

Analysis of Wine Quality Data

image of red wineIn the second example of data mining for knowledge discovery, we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, the presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. Only white wine data is analyzed. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with a possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

 

Objective of the Analysis

Prediction of Quality ranking from the chemical properties of the wines

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on the volatility of wine tasters.

Data Files for this case (right-click and "save as" ) :

The following analytical approaches are taken:

  • Multiple regression: The response Quality is assumed to be a continuous variable and is predicted by the independent predictors, all of which are continuous
  • Regression Tree
  • Classification of wines based on the chemical properties: Unsupervised analysis

WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

All variables are summarized and univariate analysis with plots are shown below.

  Sample R code for EDA

attach(WhiteWine)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(quality)), col=c("slateblue4", "slategray", "slategray1", "slategray2", "slategray3", "skyblue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(fixed.acidity, h = 0.5, col="slategray3")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(volatile.acidity, h = 0.05, col="slategray3")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(citric.acid, h = 0.1, col="slategray3")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
boxplot(fixed.acidity, col="slategray2", pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(volatile.acidity, col="slategray2", pch=19)
mtext("Volatile Acidity", cex=0.8, side=1, line=2)
boxplot(citric.acid, col="slategray2", pch=19)
mtext("Citric Acid", cex=0.8, side=1, line=2)
boxplot(residual.sugar, col="slategray2", pch=19)
mtext("Residual Sugar", cex=0.8, side=1, line=2)
boxplot(chlorides, col="slategray2", pch=19)
mtext("Chlorides", cex=0.8, side=1, line=2)

Histograms to show the distribution of the variable values:

wine quality variable summary plot

wine quality variable summary plot

wine quality variable summary plot

Boxplots for each of the variables as another indicator of spread.

wine quality variable summary plot

wine quality variable summary plot

Observations regarding variables: All variables have outliers

  • Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.
  • Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.
  • Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.
  • Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.
  • Mostly outliers are on the larger side.
  • Alcohol has an irregular shaped distribution but it does not have pronounced outliers.

  Sample R code for Summary Statistics & Correlations

summary(WhiteWine)
library("psych", lib.loc="C:/Users/sbasu/Documents/R/R-3.1.0/library")
describe(WhiteWine)
cor(WhiteWine[,-12])
cor(WhiteWine[,-12], method="spearman")
pairs(WhiteWine[,-12], gap=0, pch=19, cex=0.4, col="darkblue")
title(sub="Scatterplot of Chemical Attributes", cex=0.8)

These observations are supported by the summary statistics also, as shown in the following table:

table of summary statistics

Range is much larger compared to the IQR. Mean is usually greater than the median. These observations indicate that there are outliers in the data set and before any analysis is performed outliers must be taken care of.

Next we look at the bivariate analysis, including all pairwise scatterplot and correlation coefficients. Since the variables have non-normal distribution, we have considered both person and spearman rank correlations.

Table: Pearson’s Correlation

Pearson's correlation table

Table: Spearman Rank Correlation

Spearman Rank Correlation table

Pearson’s correlation and rank correlations are very close, hence only the former is considered. High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.

Scatterplot of Predictors

scatterplot of predictors

  Sample R code for Preparing Data

limout <- rep(0,11)
for (i in 1:11){
t1 <- quantile(WhiteWine[,i], 0.75)
t2 <- IQR(WhiteWine[,i], 0.75)
limout[i] <- t1 + 1.5*t2
}
WhiteWineIndex <- matrix(0, 4898, 11)
for (i in 1:4898)
for (j in 1:11){
if (WhiteWine[i,j] > limout[j]) WhiteWineIndex[i,j] <- 1
}
WWInd <- apply(WhiteWineIndex, 1, sum)
WhiteWineTemp <- cbind(WWInd, WhiteWine)
Indexes <- rep(0, 208)
j <- 1
for (i in 1:4898){
if (WWInd[i] > 0) {Indexes[j]<- i
j <- j + 1}
else j <- j
}
WhiteWineLib <-WhiteWine[-Indexes,]   # Inside of Q3+1.5IQR
indexes = sample(1:nrow(WhiteWineLib), size=0.5*nrow(WhiteWineLib))
WWTrain50 <- WhiteWineLib[indexes,]
WWTest50 <- WhiteWineLib[-indexes,]

Data Preparation

Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:

  • A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR

The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4074.

Data is randomly divided into Training data and Test Data of equal sizes (50% each).


WQD.2 - Multiple Regression

WQD.2 - Multiple Regression

  Sample R code for Multiple Regression

Qfit1 <- lm(quality ~ ., data=WWTrain50)
summary(Qfit1)
vif(Qfit1)
Qfit3 <- step(lm(quality ~ 1, WWTrain50), scope=list(lower=~1,  upper = ~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol), direction="forward")
Qfit4 <- lm(quality ~ alcohol + volatile.acidity + residual.sugar + free.sulfur.dioxide +
sulphates + chlorides + pH , data=WWTrain50)
summary(Qfit4)
vif(Qfit4)
Qfit5 <- lm(quality ~ alcohol + volatile.acidity + residual.sugar + rt.sulfur.dioxide +
sulphates + chlorides + pH , data=WWTrain50)
summary(Qfit5)
vif(Qfit5)
par(mfrow=c(1,2), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
truehist(residuals(Qfit4), h = 0.25, col="slategray3")
qqPlot(residuals(Qfit4), pch=19, col="darkblue", cex=0.6)
mtext("Distribution of Residuals", outer=T, side=1, line = 2)
par(mfrow=c(1,1))
pred.val <- round(fitted(Qfit4))
plot(pred.val, residuals(Qfit4))
ts.plot(residuals(Qfit4))
residualPlots(Qfit4, pch=19, col="blue", cex=0.6)
influencePlot(Qfit4,  id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
boxcox(lm(quality~alcohol), data=WWTrain50, lambda=seq(-0.2, 1.0, len=20), plotit=T)
std.del.res<-studres(Qfit4)
truehist(std.del.res, h = 0.25, col="slategray3")
mtext("Histigram of Studentized Deleted Residuals", side=1, line=2, cex=0.8)
d.fit <- dffits(Qfit4)
truehist(std.del.res, h = 0.25, col="slategray3")
truehist(d.fit, h = 0.25, col="slategray3")
mtext("Histigram of Studentized Deleted Residuals", side=1, line=2, cex=0.8)
cook.d <- cooks.distance(Qfit4)
ts.plot(cook.d)
par(mfrow=c(1,2), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
truehist(std.del.res, h = 0.55, col="slategray3")
mtext("Studentized Deleted Residuals", side=1, line=2, cex=0.8)
truehist(d.fit, h = 0.05, col="slategray3")
mtext("DFITS", side=1, line=2, cex=0.8)
par(mfrow=c(1,1), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
ts.plot(cook.d, col="dark blue")

Linear regression is fitted to the Training data.

Model I: All predictors in the model

regression coefficients table

For extremely high VIF density was removed from the model. There are other predictors with high VIF, but they were not removed at this step.

Model II: After removal of density VIFs improved

regression coefficients table

Not all predictors are significant. A forward selection method is employed to build a working model.  The sample R output follows:

R output
R output
R output
R output
R output
R output
R output
R output

Model III: Working model

regression coefficients table

Sample R output:

R output

Note that multiple R2 is 25%. Regression diagnostics are examined for possible improvement of the model.

distribution of residuals

Residuals have an approximately symmetric distribution but there seems to be outliers at both ends. Partial residual plots are given below. Note the pattern in the fitted value plot. Since the response actually takes only integer values but has been assumed to be continuous, such pattern arises.

fitted values plot

Outliers and leverage points are identified through the following:

  • Studentized deleted residuals (a point is outlier if residual is outside of [-3, 3] limits
  • DFITS (a point is outlier if residual is outside of [-1, 1] limits
  • Cook’s distance

All three plots are given below. Note that no point is identified as outlier with DFITS value.

studentized residuals and DFITS plots

Cook's distance plot

Only 26 points are identified as outliers according to the above criteria. A final model is fit after eliminating these points and a slight improvement in the R2 value is noted.

Model IV: Final model

 regression coefficients table

Sample R output:

R output

Application of this model on test data gives sum of square of differences between the actual response and predicted response to be 1196.205 whereas sum of square of deviations of actual response is 1554.754. Ratio of these two may be taken as the ratio of Error sum of squares and total sum of squares. Hence a measure similar to that of R2 may be computed as 1 – 1196.205/1554.754 = 0.2306.

  Sample R code for Final Model

Qfit6 <- lm(quality ~ poly(alcohol,2) + poly(volatile.acidity,2) + residual.sugar + poly(free.sulfur.dioxide,2) + chlorides + sulphates + poly(pH,2), data=WWTrain50In)
summary(Qfit6)
residualPlots(Qfit6, pch=19, col="blue", cex=0.6)


WQD.3 - Application of Polynomial Regression

WQD.3 - Application of Polynomial Regression

In order to investigate whether a polynomial relationship fits the model better, an alternative model with squared terms of the significant variables is tried, which improves R2 value to 31%.

regression coefficients table

Sample R output:

R output

Application of this model on test data gives sum of square of differences between the actual response and predicted response to be 1139.41 whereas sum of square of deviations of actual response is 1554.754. Ratio of these two may be taken as the ratio of Error sum of squares and total sum of squares. Hence a measure similar to that of R2 may be computed as 1 – 1139.41/1554.754 = 0.2671.


WQD.4 - Applying Tree-Based Methods

WQD.4 - Applying Tree-Based Methods

  Sample R code for Tree-based Models and Random Forest

FactQ <- as.factor(quality)
WhiteWineLib <- cbind(WhiteWineLib, FactQ)
temp <- recode(WhiteWineLib$FactQ, "c('3','4','5')='10'; c('6')='20'; else='40'")
Ptemp <- recode(temp, "c('10')='5'; ('20')='6'; else='7'")
WhiteWineLib$FactQ <- Ptemp
prop.table(table(WhiteWineLib$FactQ))
WhiteWineTree <- tree(FactQ ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol+density, data=WhiteWineLib, method="class")
plot(WhiteWineTree)
text(WhiteWineTree, pretty=0, cex=0.6)
misclass.tree(WhiteWineTree, detail=T)
Treefit1 <- predict(WhiteWineTree, WhiteWineLib, type="class")
table(Treefit1, WhiteWineLib$FactQ)
WWrf50_super <- randomForest(FactQ ~ . , data=WWTrain50T[,-12], ntree=150, importance=T, proximity=T)
WWTest50_rf_pred_super <- predict(WWrf50_super, WWTest50, type="class")
table(WWTest50_rf_pred_super, WWTest50$FactQ1)
plot(WWrf50_super, main="")
varImpPlot(WWrf50_super,  main="", cex=0.8)

The response variable quality is assumed to be an ordinal variable, not a continuous variable. It has been noted before that proportions in too low (4 or less) or too high (8 or above) categories are small.

category classification table

Hence wines are classified into three categories by combining 3, 4, and 5 into one category (Low), 6 (Medium) and 7, 8 and 9 into another (High).

The following regression tree is obtained:

R output

tree-based analysis plot

 

Applying the procedure on Test data, the following mis-classification table is obtained:

  Quality Classification
Test Data Low Medium High
Low 371 277 38
Medium 214 495 251
High 19 167 205
Accuracy (371 + 495 + 205) / 2037 = 50%

WQD.5 - Random Forest

WQD.5 - Random Forest

Completely unsupervised random forest method on Training data with ntree = 150 leads to the following error plot:

error plot

Importance of predictors are given in the following dotplot:

dotplots of results

Accuracy improves from 50% to 67.7%.

  Quality Classification
Test Data Low Medium High
Low 378 277 17
Medium 219 726 201
High 7 83 276
Accuracy (378 + 726 + 276) / 2037 = 67.7%

WQD.6 - Classification

WQD.6 - Classification

  Sample R code for Classification

nn.result <- knn(WWTrain50S, WWTrain50S, cl=WWTrain5SC$FactQ, k=5)

Nearest neighbor classifier is used with three levels (Low, Medium, High) of quality. It turned out that for k = 5, test data misclassification rate is lowest, when all predictors are being used.

  Quality Classification
Test Data Low Medium High
Low 293 185 46
Medium 267 595 215
High 44 159 233
Accuracy (293 + 595 + 233) / 2037 = 55%

WQD.7 - Conclusion

WQD.7 - Conclusion

It does not look like wine quality is well supported by its chemical properties. At each quality level variability of the predictors is high and the groups are not well separated.

Food for thought!

Think of some visualization techniques that may help in bringing out these features in the data.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility