WQD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

All variables are summarized and univariate analysis with plots are shown below.
Histograms to show the distribution of the variable values:
Boxplots for each of the variables as another indicator of spread.
Observations regarding variables: All variables have outliers
- Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.
- Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.
- Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.
- Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.
- Mostly outliers are on the larger side.
- Alcohol has an irregular shaped distribution but it does not have pronounced outliers.
Sample R code for
Summary Statistics & Correlations
These observations are supported by the summary statistics also, as shown in the following table:
Range is much larger compared to the IQR. Mean is usually greater than the median. These observations indicate that there are outliers in the data set and before any analysis is performed outliers must be taken care of.
Next we look at the bivariate analysis, including all pairwise scatterplot and correlation coefficients. Since the variables have non-normal distribution, we have considered both person and spearman rank correlations.
Table: Pearson’s Correlation
Table: Spearman Rank Correlation
Pearson’s correlation and rank correlations are very close, hence only the former is considered. High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.
Scatterplot of Predictors
Sample R code for
Preparing Data
Data Preparation
Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:
- A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR
The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4074.
Data is randomly divided into Training data and Test Data of equal sizes (50% each).