Analysis of Wine Quality Data
Analysis of Wine Quality DataIn the second example of data mining for knowledge discovery, we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, the presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.
Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. Only white wine data is analyzed. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with a possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.
Objective of the Analysis
Prediction of Quality ranking from the chemical properties of the wines
A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on the volatility of wine tasters.
Data Files for this case (right-click and "save as" ) :
- Wine data - Wine_data.csv
- Training dataset - Training50_winedata.csv
- Test dataset - Test50_winedata.csv
The following analytical approaches are taken:
- Multiple regression: The response Quality is assumed to be a continuous variable and is predicted by the independent predictors, all of which are continuous
- Regression Tree
- Classification of wines based on the chemical properties: Unsupervised analysis
WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing
WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processingAll variables are summarized and univariate analysis with plots are shown below.
Sample R code for EDA
attach(WhiteWine)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(quality)), col=c("slateblue4", "slategray", "slategray1", "slategray2", "slategray3", "skyblue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(fixed.acidity, h = 0.5, col="slategray3")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(volatile.acidity, h = 0.05, col="slategray3")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(citric.acid, h = 0.1, col="slategray3")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(fixed.acidity, col="slategray2", pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(volatile.acidity, col="slategray2", pch=19)
mtext("Volatile Acidity", cex=0.8, side=1, line=2)
boxplot(citric.acid, col="slategray2", pch=19)
mtext("Citric Acid", cex=0.8, side=1, line=2)
boxplot(residual.sugar, col="slategray2", pch=19)
mtext("Residual Sugar", cex=0.8, side=1, line=2)
boxplot(chlorides, col="slategray2", pch=19)
mtext("Chlorides", cex=0.8, side=1, line=2)
Histograms to show the distribution of the variable values:
Boxplots for each of the variables as another indicator of spread.
Observations regarding variables: All variables have outliers
- Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.
- Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.
- Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.
- Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.
- Mostly outliers are on the larger side.
- Alcohol has an irregular shaped distribution but it does not have pronounced outliers.
Sample R code for Summary Statistics & Correlations
summary(WhiteWine)
library("psych", lib.loc="C:/Users/sbasu/Documents/R/R-3.1.0/library")
describe(WhiteWine)
cor(WhiteWine[,-12])
cor(WhiteWine[,-12], method="spearman")
pairs(WhiteWine[,-12], gap=0, pch=19, cex=0.4, col="darkblue")
title(sub="Scatterplot of Chemical Attributes", cex=0.8)
These observations are supported by the summary statistics also, as shown in the following table:
Range is much larger compared to the IQR. Mean is usually greater than the median. These observations indicate that there are outliers in the data set and before any analysis is performed outliers must be taken care of.
Next we look at the bivariate analysis, including all pairwise scatterplot and correlation coefficients. Since the variables have non-normal distribution, we have considered both person and spearman rank correlations.
Table: Pearson’s Correlation
Table: Spearman Rank Correlation
Pearson’s correlation and rank correlations are very close, hence only the former is considered. High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.
Scatterplot of Predictors
Sample R code for Preparing Data
limout <- rep(0,11)
for (i in 1:11){
t1 <- quantile(WhiteWine[,i], 0.75)
t2 <- IQR(WhiteWine[,i], 0.75)
limout[i] <- t1 + 1.5*t2
}
WhiteWineIndex <- matrix(0, 4898, 11)
for (i in 1:4898)
for (j in 1:11){
if (WhiteWine[i,j] > limout[j]) WhiteWineIndex[i,j] <- 1
}
WWInd <- apply(WhiteWineIndex, 1, sum)
WhiteWineTemp <- cbind(WWInd, WhiteWine)
Indexes <- rep(0, 208)
j <- 1
for (i in 1:4898){
if (WWInd[i] > 0) {Indexes[j]<- i
j <- j + 1}
else j <- j
}
WhiteWineLib <-WhiteWine[-Indexes,] # Inside of Q3+1.5IQR
indexes = sample(1:nrow(WhiteWineLib), size=0.5*nrow(WhiteWineLib))
WWTrain50 <- WhiteWineLib[indexes,]
WWTest50 <- WhiteWineLib[-indexes,]
Data Preparation
Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:
- A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR
The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4074.
Data is randomly divided into Training data and Test Data of equal sizes (50% each).
WQD.2 - Multiple Regression
WQD.2 - Multiple RegressionSample R code for Multiple Regression
Qfit1 <- lm(quality ~ ., data=WWTrain50)
summary(Qfit1)
vif(Qfit1)
Qfit3 <- step(lm(quality ~ 1, WWTrain50), scope=list(lower=~1, upper = ~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol), direction="forward")
Qfit4 <- lm(quality ~ alcohol + volatile.acidity + residual.sugar + free.sulfur.dioxide +
sulphates + chlorides + pH , data=WWTrain50)
summary(Qfit4)
vif(Qfit4)
Qfit5 <- lm(quality ~ alcohol + volatile.acidity + residual.sugar + rt.sulfur.dioxide +
sulphates + chlorides + pH , data=WWTrain50)
summary(Qfit5)
vif(Qfit5)
par(mfrow=c(1,2), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
truehist(residuals(Qfit4), h = 0.25, col="slategray3")
qqPlot(residuals(Qfit4), pch=19, col="darkblue", cex=0.6)
mtext("Distribution of Residuals", outer=T, side=1, line = 2)
par(mfrow=c(1,1))
pred.val <- round(fitted(Qfit4))
plot(pred.val, residuals(Qfit4))
ts.plot(residuals(Qfit4))
residualPlots(Qfit4, pch=19, col="blue", cex=0.6)
influencePlot(Qfit4, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
boxcox(lm(quality~alcohol), data=WWTrain50, lambda=seq(-0.2, 1.0, len=20), plotit=T)
std.del.res<-studres(Qfit4)
truehist(std.del.res, h = 0.25, col="slategray3")
mtext("Histigram of Studentized Deleted Residuals", side=1, line=2, cex=0.8)
d.fit <- dffits(Qfit4)
truehist(std.del.res, h = 0.25, col="slategray3")
truehist(d.fit, h = 0.25, col="slategray3")
mtext("Histigram of Studentized Deleted Residuals", side=1, line=2, cex=0.8)
cook.d <- cooks.distance(Qfit4)
ts.plot(cook.d)
par(mfrow=c(1,2), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
truehist(std.del.res, h = 0.55, col="slategray3")
mtext("Studentized Deleted Residuals", side=1, line=2, cex=0.8)
truehist(d.fit, h = 0.05, col="slategray3")
mtext("DFITS", side=1, line=2, cex=0.8)
par(mfrow=c(1,1), oma = c(3,2,3,2) + 0.1, mar = c(1,1,1,1) + 0.1)
ts.plot(cook.d, col="dark blue")
Linear regression is fitted to the Training data.
Model I: All predictors in the model
For extremely high VIF density was removed from the model. There are other predictors with high VIF, but they were not removed at this step.
Model II: After removal of density VIFs improved
Not all predictors are significant. A forward selection method is employed to build a working model. The sample R output follows:
Model III: Working model
Sample R output:
Note that multiple R2 is 25%. Regression diagnostics are examined for possible improvement of the model.
Residuals have an approximately symmetric distribution but there seems to be outliers at both ends. Partial residual plots are given below. Note the pattern in the fitted value plot. Since the response actually takes only integer values but has been assumed to be continuous, such pattern arises.
Outliers and leverage points are identified through the following:
- Studentized deleted residuals (a point is outlier if residual is outside of [-3, 3] limits
- DFITS (a point is outlier if residual is outside of [-1, 1] limits
- Cook’s distance
All three plots are given below. Note that no point is identified as outlier with DFITS value.
Only 26 points are identified as outliers according to the above criteria. A final model is fit after eliminating these points and a slight improvement in the R2 value is noted.
Model IV: Final model
Sample R output:
Application of this model on test data gives sum of square of differences between the actual response and predicted response to be 1196.205 whereas sum of square of deviations of actual response is 1554.754. Ratio of these two may be taken as the ratio of Error sum of squares and total sum of squares. Hence a measure similar to that of R2 may be computed as 1 – 1196.205/1554.754 = 0.2306.
Sample R code for Final Model
Qfit6 <- lm(quality ~ poly(alcohol,2) + poly(volatile.acidity,2) + residual.sugar + poly(free.sulfur.dioxide,2) + chlorides + sulphates + poly(pH,2), data=WWTrain50In)
summary(Qfit6)
residualPlots(Qfit6, pch=19, col="blue", cex=0.6)
WQD.3 - Application of Polynomial Regression
WQD.3 - Application of Polynomial RegressionIn order to investigate whether a polynomial relationship fits the model better, an alternative model with squared terms of the significant variables is tried, which improves R2 value to 31%.
Sample R output:
Application of this model on test data gives sum of square of differences between the actual response and predicted response to be 1139.41 whereas sum of square of deviations of actual response is 1554.754. Ratio of these two may be taken as the ratio of Error sum of squares and total sum of squares. Hence a measure similar to that of R2 may be computed as 1 – 1139.41/1554.754 = 0.2671.
WQD.4 - Applying Tree-Based Methods
WQD.4 - Applying Tree-Based MethodsSample R code for Tree-based Models and Random Forest
FactQ <- as.factor(quality)
WhiteWineLib <- cbind(WhiteWineLib, FactQ)
temp <- recode(WhiteWineLib$FactQ, "c('3','4','5')='10'; c('6')='20'; else='40'")
Ptemp <- recode(temp, "c('10')='5'; ('20')='6'; else='7'")
WhiteWineLib$FactQ <- Ptemp
prop.table(table(WhiteWineLib$FactQ))
WhiteWineTree <- tree(FactQ ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol+density, data=WhiteWineLib, method="class")
plot(WhiteWineTree)
text(WhiteWineTree, pretty=0, cex=0.6)
misclass.tree(WhiteWineTree, detail=T)
Treefit1 <- predict(WhiteWineTree, WhiteWineLib, type="class")
table(Treefit1, WhiteWineLib$FactQ)
WWrf50_super <- randomForest(FactQ ~ . , data=WWTrain50T[,-12], ntree=150, importance=T, proximity=T)
WWTest50_rf_pred_super <- predict(WWrf50_super, WWTest50, type="class")
table(WWTest50_rf_pred_super, WWTest50$FactQ1)
plot(WWrf50_super, main="")
varImpPlot(WWrf50_super, main="", cex=0.8)
The response variable quality is assumed to be an ordinal variable, not a continuous variable. It has been noted before that proportions in too low (4 or less) or too high (8 or above) categories are small.
Hence wines are classified into three categories by combining 3, 4, and 5 into one category (Low), 6 (Medium) and 7, 8 and 9 into another (High).
The following regression tree is obtained:
Applying the procedure on Test data, the following mis-classification table is obtained:
Quality Classification | |||
Test Data | Low | Medium | High |
Low | 371 | 277 | 38 |
Medium | 214 | 495 | 251 |
High | 19 | 167 | 205 |
Accuracy | (371 + 495 + 205) / 2037 = 50% |
WQD.5 - Random Forest
WQD.5 - Random ForestCompletely unsupervised random forest method on Training data with ntree = 150 leads to the following error plot:
Importance of predictors are given in the following dotplot:
Accuracy improves from 50% to 67.7%.
Quality Classification | |||
Test Data | Low | Medium | High |
Low | 378 | 277 | 17 |
Medium | 219 | 726 | 201 |
High | 7 | 83 | 276 |
Accuracy | (378 + 726 + 276) / 2037 = 67.7% |
WQD.6 - Classification
WQD.6 - ClassificationSample R code for Classification
nn.result <- knn(WWTrain50S, WWTrain50S, cl=WWTrain5SC$FactQ, k=5)
Nearest neighbor classifier is used with three levels (Low, Medium, High) of quality. It turned out that for k = 5, test data misclassification rate is lowest, when all predictors are being used.
Quality Classification | |||
Test Data | Low | Medium | High |
Low | 293 | 185 | 46 |
Medium | 267 | 595 | 215 |
High | 44 | 159 | 233 |
Accuracy | (293 + 595 + 233) / 2037 = 55% |
WQD.7 - Conclusion
WQD.7 - ConclusionIt does not look like wine quality is well supported by its chemical properties. At each quality level variability of the predictors is high and the groups are not well separated.
Food for thought!
Think of some visualization techniques that may help in bringing out these features in the data.