WQD.4 - Applying Tree-Based Methods

Sample R code for Tree-based Models and Random Forest

Sample R code for Tree-based Models and Random Forest

FactQ <- as.factor(quality)

WhiteWineLib <- cbind(WhiteWineLib, FactQ)

temp <- recode(WhiteWineLib$FactQ, "c('3','4','5')='10'; c('6')='20'; else='40'")

Ptemp <- recode(temp, "c('10')='5'; ('20')='6'; else='7'")

WhiteWineLib$FactQ <- Ptemp

prop.table(table(WhiteWineLib$FactQ))

WhiteWineTree <- tree(FactQ ~ fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol+density, data=WhiteWineLib, method="class")

plot(WhiteWineTree)

text(WhiteWineTree, pretty=0, cex=0.6)

misclass.tree(WhiteWineTree, detail=T)

Treefit1 <- predict(WhiteWineTree, WhiteWineLib, type="class")

table(Treefit1, WhiteWineLib$FactQ)

WWrf50_super <- randomForest(FactQ ~ . , data=WWTrain50T[,-12], ntree=150, importance=T, proximity=T)

WWTest50_rf_pred_super <- predict(WWrf50_super, WWTest50, type="class")

table(WWTest50_rf_pred_super, WWTest50$FactQ1)

plot(WWrf50_super, main="")

varImpPlot(WWrf50_super,  main="", cex=0.8)

The response variable quality is assumed to be an ordinal variable, not a continuous variable. It has been noted before that proportions in too low (4 or less) or too high (8 or above) categories are small.

category classification table

Hence wines are classified into three categories by combining 3, 4, and 5 into one category (Low), 6 (Medium) and 7, 8 and 9 into another (High).

The following regression tree is obtained:

R output

tree-based analysis plot

Applying the procedure on Test data, the following mis-classification table is obtained:

	Quality Classification
Test Data	Low	Medium	High
Low	371	277	38
Medium	214	495	251
High	19	167	205
Accuracy	(371 + 495 + 205) / 2037 = 50%