2  Exploratory Data Analysis (EDA)

EDA
Types of Data
Summary Stats

2.1 Overview

Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation. Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.

In this lesson, we will focus on both aspects of EDA:

  • Numerical summarization
  • Data Visualization

This course is based on R software. There are several attractive features of R that make it a software of choice both in academia as well as in industry.

  • R is an open-source software and is free to download.
  • R is supported by 3,000+ packages to deal with large volumes of data in a wide variety of applications. For instance, the svd() function performs the singular value decomposition in a single line of coding, which cannot be so easily implemented in C, Java or Python.
  • R is quite versatile. After an algorithm is developed in R, the program may be sped up by transforming the R codes into other languages.
  • R is a mainstream analytical tool.

Reference: * *

  • The Popularity of Data Analysis Software by R.A. Muenchen,
  • R You Ready for R? by Ashlee Vance
  • R Programming for Data Science by Roger Peng

The following diagram shows that in recent times R is gaining popularity as monthly programming discussion traffic shows explosive growth of discussions regarding R.

The rise of R as a data science programming language from 1990 to 2010

R has a vibrant user community. As a result of that R has the most website links that point to it.

R can be installed from the CRAN website R-Project following the instructions. Downloading R-Studio is strongly recommended. To develop familiarity with R it is suggested to follow through the material in Introduction to R. For further information refer to the Course Syllabus. Other useful websites on R are Stack Overflow R Questions and R Seek.

One of the objectives of this course is to strengthen the basics in R. The R-Labs given in the textbook are followed closely. Along with the material in the text, two other features in R are introduced.

  • R Markdown: This allows the users to knit the R codes and outputs directly into the document.
  • R library ggplot2`: A very useful and sophisticated set of plotting functions to produce high-quality graphs

Objectives

Upon successful completion of this lesson, you should be able to:


  • Develop familiarity with R software.
  • Application of numerical and visual summarization of data.
  • Illustration of the importance of EDA before embarking on sophisticated model building.

2.2 What is Data

Introduction

Anything that is observed or conceptualized falls under the purview of data. In a somewhat restricted view, data is something that can be measured. Data represent facts or something that has actually taken place, observed and measured. Data may come out of passive observation or active collection. Each data point must be rooted in a physical, demographical or behavioral phenomenon must be unambiguous and measurable. Data is observed in each unit under study and stored in an electronic device.

Definition 2.1 (Data) denotes a collection of objects and their attributes

Definition 2.2 (Attribute) (feature, variable, or field) is a property or characteristic of an object

Definition 2.3 (Collection of Attributes) describe an object (individual, entity, case, or record)

ID Sex Education Income
248 Male High School $100,000
249 Female High School $12,000
250 Male College $23,000
251 Male Child $0
252 Female High School $19,798
253 Male High School $40,100
254 Male Less than 1st Grade $2691
255 Male Child $0
256 Male 11th Grade $30,000
257 Male Ph.D. $30686

Each Row is an Object and each Column is an Attribute

Often these attributes are referred to as variables. Attributes contain information regarding each unit of observation. Depending on how many different types of information are collected from each unit, the data may be univariate, bivariate or multivariate.

Data can have varied forms and structures but in one criterion they are all the same – data contains information and characteristics that separate one unit or observation from the others.

Types of Attributes

Definition 2.4 (Nominal) Qualitative variables that do not have a natural order, e.g. Hair color, Religion, Residence zipcode of a student

Definition 2.5 (Ordinal) Qualitative variables that have a natural order, e.g. Grades, Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is excellent), Street numbers in New York City

Definition 2.6 (Interval) Measurements where the difference between two values is meaningful, e.g. Calendar dates, Temperature in Celsius or Fahrenheit

Definition 2.7 (Ratio) Measurements where both difference and ratio are meaningful, e.g. Temperature in Kelvin, Length, Counts

Discrete and Continuous Attributes

Definition 2.8 (Discrete Attribute) A variable or attribute is discrete if it can take a finite or a countably infinite set of values. A discrete variable is often represented as an integer-valued variable. A binary variable is a special case where the attribute can assume only two values, usually represented by 0 and 1. Examples of a discrete variable are the number of birds in a flock; the number of heads realized when a coin is flipped 10 times, etc.

Definition 2.9 (Continuous Attribute) A variable or attribute is continuous if it can take any value in a given range; possibly the range being infinite. Examples of continuous variables are weights and heights of birds, the temperature of a day, etc.

In the hierarchy of data, nominal is at the lowermost rank as it carries the least information. The highest type of data is ratio since it contains the maximum possible information. While analyzing the data, it has to be noted that procedures applicable to a lower data type can be applied for a higher one, but the reverse is not true. Analysis procedure for nominal data can be applied to interval type data, but it is not recommended since such a procedure completely ignores the amount of information an interval type data carries. But the procedures developed for interval or even ratio type data cannot be applied to nominal nor to ordinal data. A prudent analyst should recognize each data type and then decide on the methods applicable.


2.3 Numerical Summarization

Summary Statistics

The vast amount of numbers on a large number of variables need to be properly organized to extract information from them. Broadly speaking there are two methods to summarize data: visual summarization and numerical summarization. Both have their advantages and disadvantages and applied jointly they will get the maximum information from raw data.

Summary statistics are numbers computed from the sample that present a summary of the attributes.

Measures of Location

They are single numbers representing a set of observations. Measures of location also include measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode, and the Quartiles.

Definition 2.10 (Mean) the arithmetic average of all the observations. The mean equals the sum of all observations divided by the sample size

Definition 2.11 (Median) the middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency

2.3.1 Mode

the most frequently occurring value in the data set. This makes more sense when attributes are not continuous

2.3.2 Quartiles

division points which split data into four equal parts after rank-ordering them.

Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile)

Note!
They are not necessarily four equidistance point on the range of the sample

Similarly, Deciles and Percentiles are defined as division points that divide the rank-ordered data into 10 and 100 equal segments.

Note! that the mean is very sensitive to outliers (extreme or unusual observations) whereas the median is not. The mean is affected if even a single observation is changed. The median, on the other hand, has a 50% breakdown which means that unless 50% values in a sample change, the median will not change.

Measures of Spread

Measures of location are not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion is the Variance, the Standard Deviation, the Interquartile Range and Range.

Definition 2.12 (Variance) measures how far data values lie from the mean. It is defined as the average of the squared differences between the mean and the individual data values

Definition 2.13 (Standard Deviation) is the square root of the variance. It is defined as the average distance between the mean and the individual data values

Definition 2.14 (Interquartile range (IQR)) is the difference between Q3 and Q1. IQR contains the middle 50% of data

Definition 2.15 (Range) is the difference between the maximum and minimum values in the sample

Measures of Skewness

In addition to the measures of location and dispersion, the arrangement of data or the shape of the data distribution is also of considerable interest. The most ‘well-behaved’ distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness measures whether or not a distribution has a single long tail.

Skewness is measured as: \[ \dfrac{\sqrt{n} \left( \Sigma \left(x_{i} - \bar{x} \right)^{3} \right)}{\left(\Sigma \left(x_{i} - \bar{x} \right)^{2}\right)^{\frac{3}{2}}}\]

The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.

example of a symmetric distribution example of a right skewed distribution example of a left skewed distribution

Try it!

Calculate the answers to these questions then click the icon on the left to reveal the answer.

Suppose we have the data: 3, 5, 6, 9, 0, 10, 1, 3, 7, 4, 8. Calculate the following summary statistics:

  • Mean
  • Median
  • Mode
  • Q1 and Q3
  • Variance and Standard Deviation
  • IQR
  • Range
  • Skewness

Answer

  • Mean: (3+5+6+9+0+10+1+3+7+4+8)/11= 5.091.
  • Median: The ordered data is 0, 1, 3, 3, 4, 5, 6, 7, 8, 9, 10. Thus, 5 is the median.
  • Mode: 3.
  • Q1 and Q3: Q1 is 3 and Q3 is 8.
  • Variance and Standard Deviation: Variance is 10.491 (=((3-5.091)2+…+(8-5.091)2)/10). Thus, the standard deviation is the square root of 10.491, i.e. 3.239.
  • IQR: Q3-Q1=8-3=5.
  • Range: max-min=10-0=10.
  • Skewness: -0.03.

Measures of Correlation

All the above summary statistics are applicable only for univariate data where information on a single attribute is of interest. Correlation describes the degree of the linear relationship between two attributes, X and Y.

With X taking the values x(1), … , x(n) and Y taking the values y(1), … , y(n), the sample correlation coefficient is defined as: \[\rho (X,Y)=\dfrac{\sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )\left ( y(i)-\bar{y} \right )}{\left( \sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )^2\sum_{i=1}^{n}\left ( y(i)-\bar{y} \right )^2\right)^\frac{1}{2}}\]

The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.

In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values, the relation is a perfectly straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction.

example correlation coefficients

Try It!


2.3.3 Measures of Similarity and Dissimilarity

Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Definition 2.16 (Similarity Measure) Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1 (complete similarity)

Definition 2.17 (Dissimilarity Measure) Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\) (objects are different)

Definition 2.18 (Proximity) refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute Type Similarity Dissimilarity
Nominal \(s=\begin{cases}
1 & \text{ if } p=q \\
0 & \text{ if } p\neq q
\end{cases}\)
\(d=\begin{cases}
0 & \text{ if } p=q \\
1 & \text{ if } p\neq q
\end{cases}\)
Ordinal

\(s=1-\dfrac{\left | p-q \right |}{n-1}\)

(values mapped to integer 0 to n-1, where n is the number of values)

\(d=\dfrac{\left | p-q \right |}{n-1}\)
Interval or Ratio \(s=1-\left | p-q \right |, s=\frac{1}{1+\left | p-q \right |}\) \(d=\left | p-q \right |\)

Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

  1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
  2. d(p, q) = d(q,p) for all p and q,
  3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

Euclidean Distance

Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\) (also called attributes).

The Euclidean distance between the ith and jth objects is \[d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

for every pair (i, j) of observations.

The weighted Euclidean distance is: \[d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is \[d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda} \]

where \(\lambda \geq 1\). It is also called the \(L_λ\) metric.

  • \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
  • \(\lambda = 2 : L _ { 2 }\) metric, Euclidean distance.
  • \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance. \[ \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \]

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance

Let X be a N × p matrix. Then the \(i^{th}\) row of X is \[x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\]

The Mahalanobis distance is \[d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\]

where \(∑\) is the p×p sample covariance matrix.

Try it!

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

  1. Calculate the Euclidan distances.
  2. Calculate the Minkowski distances (\(\lambda = 1\text{ and }\lambda\rightarrow\infty\) cases).

Answer

  1. Euclidean distances are: \[d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\]

\[d_{ E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\]

\[d_{ E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\]

  1. Minkowski distances (when \(\lambda = 1\) ) are:

\[d_ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\]

\[d_ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\]

\[d_ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\]

Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

\[d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\]

\[d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\]

Try it!

  1. Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
  2. Calculate the Mahalanobis distance between the first and second objects.

Answer

  1. Minkowski distance is:

\[\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\]

\[\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\]

\[\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\]

  1. \[\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\]. \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\) we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\) Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)
  • R code for Mahalanobis distance
    # Mahalanobis distance calculation 
    d1 = c(2, 3) # each observation 
    d2 = c(10, 7) 
    d3 = c(3, 2) 
    # Get covariance matrix using "ALL" observations 
    cov_all = cov(rbind(d1, d2, d3)) 
    cov_all 
    # Inverse covariance matrix is given by: 
    solve(cov_all) 
    # Mahalanobis distance is given by: 
    Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) 
    Mahalanobis_dist 

Common Properties of Similarity Measures

Similarities have some well-known properties:

  1. s(p, q) = 1 (or maximum similarity) only if p = q,
  2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

q=1 q=0
p=1 n1,1 n1,0
p=0 n0,1 n0,0

Simple Matching and Jaccard Coefficients

  • Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
  • Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).

Try it!

Calculate the answers to the question and then click the icon on the left to reveal the answer.

Given data:

  • p = 1 0 0 0 0 0 0 0 0 0
  • q = 0 0 0 0 0 0 1 0 0 1

The frequency table is:

q=1 q=0
p=1 0 1
p=0 2 7

Calculate the Simple matching coefficient and the Jaccard coefficient.

Answer

  • Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
  • Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

2.4 Visualization

To understand thousands of rows of data in a limited time there is no alternative to visual representation. The objective of visualization is to reveal hidden information through simple charts and diagrams. Visual representation of data is the first step toward data exploration and formulation of an analytical relationship among the variables. In a whirl of complex and voluminous data, visualization in one, two, and three-dimension helps data analysts to sift through data in a logical manner and understand the data dynamics. It is instrumental in identifying patterns and relationships among groups of variables. Visualization techniques depend on the type of variables. Techniques available to represent nominal variables are generally not suitable for visualizing continuous variables and vice versa. Data often contains complex information. It is easy to internalize complex information through visual mode. Graphs, charts, and other visual representations provide quick and focused summarization.

Tools for Displaying Single Variables

Histogram

Histograms are the most common graphical tool to represent continuous data. On the horizontal axis, the range of the sample is plotted. On the vertical axis are plotted the frequencies or relative frequencies of each class. The class width has an impact on the shape of the histogram. The histograms in the previous section were drawn from a random sample generated from theoretical distributions. Here we consider a real example to construct histograms.

The dataset used for this purpose is the Wage data that is included in the ISLR package in R. A full description of the data is given in the package. The following R code produces the figure below which illustrates the distribution of wages for all 3000 workers.

  • Sample R code for Distribution of Wage
library(ISLR)
with(Wage, hist(wage, nclass=20, col="grey", border="navy", main="", xlab="Wage", cex=1.2))
title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4)
Histogram showing the distribution of the wages of all 300 workers.
Fig 2.3: Distribution of Wage

The data is mostly symmetrically distributed but there is a small bimodality in the data which is indicated by a small hump towards the right tail of the distribution.

The data set contains a number of categorical variables one of which is Race. A natural question is whether the wage distribution is the same across Race. There are several libraries in R which may be used to construct histograms across levels of categorical variables and many other sophisticated graphs and charts. One such library is ggplot2. Details of the functionalities of this library will be given in the R code below.

In the following figures, histograms are drawn for each Race separately.

  • Sample R code for Histogram of Wage by Race
library(ggplot2)
    library(ISLR)
    p <- ggplot(data = Wage, aes(x=wage)) 
    p <- p + geom_histogram(binwidth=25, aes(fill=race))
    p <- p + scale_fill_brewer(palette="Set1")
    p <- p + facet_wrap( ~ race, ncol=2)
    p <- p + labs(x="Wage", y="Frequency")+ theme(axis.title = 
    element_text(color="black", face="bold"))
    p <- p + ggtitle("Histogram of Wage by Race") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
    p
Histogram showing the distribution of the wages of all 300 workers grouped by race.
Fig 2.4: Distribution of Wage by Race

Because of the huge disparity among the counts of the different races, the above histograms may not be very informative. Code for an alternative visual display of the same information is shown below, followed by the plot.

  • Sample R code for Histogram of Wage by Race (Alternative)
library(ggplot2)
    library(ISLR)
    p1 <- ggplot(Wage, aes(x=wage,fill=race))+geom_histogram(binwidth=15,position="identity")
    p11 <- p1 + scale_fill_manual(values = alpha(c("mediumvioletred","navy", 
    "green", "red"), 0.5))
    p2 <- p11 + labs(x="Wage", y="Frequency")+ theme(axis.title = 
    element_text(color="black", face="bold"))
    p3 <- p2 + ggtitle("Histogram of Wage by Race") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
    p3
Histogram showing the distribution of the wages of all 300 workers grouped by race.
Fig 2.5: Distribution of Wage by Race

The second type of histogram also may not be the best way of presenting all the information. However further clarity is seen in a small concentration at the right tail.

Boxplot

Boxplot is used to describe the shape of data distribution and especially to identify outliers. Typically an observation is an outlier if it is either less than Q1 - 1.5 IQR or greater than Q3 + 1.5 IQR, where IQR is the inter-quartile range defined as Q3 - Q1. This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q1 - 3 IQR, Q3 + 3 IQR] are only identified as outliers.

  • Sample R code for Boxplot of Distribution of Wage
library(ISLR)
    with(Wage,boxplot(wage,col="grey", border="navy", main="", xlab="Wage",pch = 19, cex=0.8))
    title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4)
Boxplot showing the distribution of the wages of all 300 workers.
Fig 2.6: Distribution of Wage by Race

The boxplot of the Wage distribution clearly identifies many outliers. It is a reflection of the histogram depicting the distribution of Wage. The story is clearer from the boxplots drawn on the wage distribution for individual races. Here is the R code:

Here is the boxplot that results: * Sample R code for Boxplot Wage by Race

# Boxplot: Wage data by race
    library(ggplot2)
    library(ISLR)
    p1 <- ggplot(Wage, aes(x=race,y=wage))+geom_boxplot()
    p2 <- p1 + labs(x="Race", y="Wage")+ theme(axis.title = 
    element_text(color="black", face="bold", size = 12))
    p3 <- p2 + ggtitle("Boxplot of Wage by Race") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
    p3
Boxplot showing the distribution of the wages of all 300 workers.
Fig 2.7: Boxplots of Wage by Race

Tools for Displaying Relationships Between Two Variables

Scatterplot

The most standard way to visualize relationships between two variables is a scatterplot. It shows the direction and strength of association between two variables but does not quantify it. Scatterplots also help to identify unusual observations. In the previous section (Section 1(b).2) a set of scatterplots are drawn for different values of the correlation coefficient. The data there is generated from a theoretical distribution of multivariate normal distribution with various values of the correlation parameter. Below is the R code used to obtain a scatterplot for these data:

The following is the scatterplot of the variables Age and Wage for the Wage data. * Sample R Code for Relationship of Age and Wage

library(ISLR)
    with(Wage, plot(age, wage, pch = 19, cex=0.6))
    title(main = "Relationship between Age and Wage")
Scatterplot between Age and Wage
Fig 2.8: Relationship between Age and Wage

It is clear from the scatterplot that the Wage does not seem to depend on Age very strongly. However, a set of points towards the top are very different from the rest. A natural follow-up question is whether Race has any impact on the Age-Wage dependency or the lack of it. Here is the R code and then the new plot:

  • Sample R Code for Relationship of Age and Wage
# Scatterplot: Wage vs. Age by race
    library(ISLR)
    with(Wage, plot(age, wage, col = c("lightgreen","navy", "mediumvioletred", 
    "red")[race], pch = 19, cex=0.6))
    legend(70, 310, legend=levels(Wage$race), col=c("lightgreen","navy", 
    "mediumvioletred", "red"), bty="n", cex=0.7, pch=19)
    title(main = "Relationship between Age and Wage by Race")
Scatterplot between Age and Wage by Race
Fig 2.9: Relationship between Age and Wage by Race

We have noted before that the disproportionately high number of Whites in the data masks the effects of the other races. There does not seem to be any association between Age and Wage, controlling for Race.

Contour plot

This is useful when a continuous attribute is measured on a spatial grid. They partition the plane into regions of similar values. The contour lines that form the boundaries of these regions connect points with equal values. In spatial statistics, contour plots have a lot of applications.

Contour plots join points of equal probability. Within the contour lines concentration of bivariate distribution is the same. One may think of the contour lines as slices of a bivariate density, sliced horizontally. Contour plots are concentric; if they are perfect circles then the random variables are independent. The more oval-shaped they are, the farther they are from independence. Note the conceptual similarity in the scatterplot series in Sec 1.(b).2. In the following plot, the two disjoint shapes in the interior-most part indicate that a small part of the data is very different from the rest.

Here is the R code for the contour plot that follows:

  • Sample R Code for Contour Plot of Age and Wage
# Contour Plot: Age and Wage
    library(ggplot2)
    library(ISLR)
    d0 <- ggplot(Wage,aes(age, wage))+ stat_density2d()
    d0 <- d0 +labs(x="Age", y="Wage")+ theme(axis.title = 
    element_text(color="black", face="bold"))
    d0 + ggtitle("Contour Plot of Age and Wage") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
Contour Plot of Age and Wage
Fig 2.10: Contour Plot of Age and Wage

Tools for Displaying More Than Two Variables

Scatterplot Matrix

Displaying more than two variables on a single scatterplot is not possible. A scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.

The data set used to display the scatterplot matrix is the College data that is included in the ISLR package. A full description of the data is given in the package. Here is the R code for the scatterplot matrix that follows:

  • Sample R Code for Scatterplot Matrix of College Attributes
library(ISLR)
    attach(College)
    library(car)
    X <- cbind(Apps, Accept, Enroll, Room.Board, Books)
    scatterplotMatrix(X, diagonal=c("boxplot"), reg.line=F, smoother=F, pch=19, cex=0.6, col="blue")
    title (main="Scatterplot Matrix of College Attributes", col.main="navy", font.main=4, line = 3)
Scatterplot Matrix of College Attributes
Fig 2.11: Scatterplot Matrix of College Attributes

Parallel Coordinates

An innovative way to present multiple dimensions in the same figure is by using parallel coordinate systems. Each dimension is presented by one coordinate and instead of plotting coordinates at the right angle to one another, each coordinate is placed side-by-side. The advantage of such an arrangement is that many different continuous and discrete variables can be handled within a parallel coordinate system, but if the number of observations is too large, the profiles do not separate out from one another and patterns may be missed.

The illustration below corresponds to the Auto data from the ISLR package. Only 35 cars are considered but all dimensions are taken into account. The cars considered are different varieties of Toyota and Ford, categorized into two groups: produced before 1975 and produced in 1975 or after. The older models are represented by dotted lines whereas the newer cars are represented by dashed lines. The Fords are represented by blue color and Toyotas are represented by pink color. Here is the R code for the profile plot of this data that follows:

  • Sample R Code for Profile Plot of Toyota and Ford Cars
library(ISLR)
    library(MASS)
    # using the Auto data in ISLR, string match auto names on “toyota” and “ford”
    # and work with corresponding data subset. Also, need to create variable Make.
    Comp1 = Auto[c(grep("toyota", Auto$name), grep("ford", Auto$name)), ]
    Comp1$Make = c(rep("Toyota", 25), rep("Ford", 48))
    Y = with(Comp1, cbind(cylinders, weight, horsepower, displacement, acceleration, mpg))
    # Colors by condition:
    car.colors = ifelse(test = Comp1$Make=="Ford", yes = "blue", no = "magenta")
    # Line type by condition:
    car.lty = ifelse(test = Comp1$year < 75, yes = "dotted", no = "longdash")
    parcoord(Y, col = car.colors, lty = car.lty, var.label=T)
    mtext("Profile Plot of Toyota and Ford Cars", line = 2)
Profile plot of Toyota and Ford cars

The differences among the four groups are very clear from the figure. Early Ford models had 8 cylinders, were heavy, and had high horsepower and displacement. Naturally, they had low MPG and less time to accelerate. No Toyota belonged to this category. All Toyota cars are built after 1975, have 4 cylinders (one exception only) and MPG performance belongs to the upper half of the distribution. Note that only 35 cars are compared in the profile plot. Hence each car can be followed over all the attributes. However had the number of observations been higher, the distinction among the profiles would have been lost and the plot would not be informative.

Interesting Multivariate Plots

Following are some interesting visualization of multivariate data. In Star Plot, stars are drawn according to rules as defined by their characteristics. Each axis represents one attribute and the solid lines represent each item’s value on that attribute. All attributes of the observations are possible to be represented; however, for the sake of clarity on the graph only 10 attributes are chosen.

Again, the starplot follows the R code for generating the plot:

  • Sample R Code for Starplot of College Data
library(ISLR)
    library(graphics)
    require(grDevices)
    CollegeSmall = College[College$Enroll <= 100,] ## From the College data in ISLR
    stars(CollegeSmall, labels=NULL)
    mtext("Starplot of College Data",line=2)
Starplot of College Data
Fig 2.13: Starplot of College Data

Another interesting plot technique with multivariate data is Chernoff Face where attributes of each observation are used to draw different features of the face. A comparison of 30 colleges and universities from the College dataset is compared below.

Again, R code and then the plot follows:

  • Sample R Code for Comparison of Colleges and Universities
library(ISLR)

library(TeachingDemos)
    CollegeSmall = College[College$Enroll <= 100,] ## From the College data in ISLR
    ## Create shorter labels for display
    ShortLabels = c("Alaska Pacific", "Capitol", "Centenary", "Central Wesleyan", "Chatham", "Christendom", "Notre Dame", "St. Joseph", "Lesley", "McPherson", "Mount Saint Clare", "Saint Francis IN", "Saint Joseph", "Saint Mary-of-the-Woods", "Southwestern", "Spalding", "St. Martin's", "Tennessee Wesleyan", "Trinity DC", "Trinity VT", "Ursuline", "Webber", "Wilson",  "Wisconsin Lutheran")
    faces(CollegeSmall[,-c(1:4)], scale=T, nrow=6, ncol=4, labels= ShortLabels)
    mtext("Comparison of Selected Colleges and Universities",line=2)
Comparison of Colleges and Universities
Fig 2.14: Comparison of Colleges and Universities

For comparison of a small number of observations on up to 15 attributes, Chernoff’s face is a useful technique. However, whether two items are more similar or less, depends on interpretation.


2.5 R Scripts

This course requires a fair amount of R coding. The textbook takes the reader through R codes relevant for the chapter in a step-by-step manner. Sample R codes are also provided in the Visualization section. In this section, a brief introduction is given on a few of the important and useful features of R.

Introductions to R are available at Statistical R Tutorials and Cran R Project. There are many other online resources available for R. R users’ groups are thriving and highly communicative. A few additional resources are mentioned in the Course Syllabus.

One of the most important features of R is its libraries. They are freely downloadable from CRAN site. It is not possible to make a list of ALL or even MOST R packages. The list is ever changing as R users community is continuously building and refining the available packages. The link below is a good starting point for a list of packages for data manipulation and visualization.

R Studio Useful Packages

R Library: ggplot2

R has many packages and plotting options for data visualization but possibly none of them are able to produce as beautiful and as customizable statistical graphics as ggplot2 does. It is unlike most other graphics packages because it has a deep underlying grammar based on the Grammar of Graphics (Wilkinson, 2005). It is composed of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because the user is not limited to a set of pre-specified graphics. The plots can be built up iteratively and edited later. The package is designed to work in a layered fashion, starting with a layer showing the raw data and then adding layers of annotations and statistical summaries.

The grammar of graphics is an answer to a question: what is a statistical graphic?

In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

A brief description of the main components are as below:

  • The data and a set of aesthetic mappings describe how variables in the data are mapped to various aesthetic attributes
  • Geometric objects, geoms for short, represent what is actually on the plot: points, lines, polygons, etc.
  • Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model. Stats are optional but very useful.
  • A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

The basic command for plotting is qplot(X, Y, data = <data name>) (quick plot!). Unlike the most common plot() command, qplot() can be used for producing many other types of graphics by varying geom(). Examples of a few common geom() are given below.

  • geom = “point” is the default
  • geom = “smooth” fits a smoother to the data and displays the smooth and its standard error
  • geom = “boxplot” produces a box-and-whisker plot to summarise the distribution of a set of points

For continuous variables

  • geom = “histogram” draws a histogram
  • geom = “density” draws a density plot

For discrete variables

  • geom = “bar” produces a bar chart.

Aesthetics and faceting are two important features of ggplot2. Color, shape, size and other aesthetic arguments are used if observations coming from different subgroups are plotted on the same graph. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.

From Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis, Springer.

R Markdown

Markdown is an extremely useful facility in R which lets a user incorporate R codes and outputs directly in a document. For a comprehensive knowledge on Markdown and how to use it, you may consult R Markdown in the course STAT 485.