Lesson 1 (b): Exploratory Data Analysis (EDA)

Introduction

Key Learning Goals for this Lesson:

Develop familiarity with R software
Application of numerical and visual summarization of data
Illustration of importance of EDA before embarking on sophisticated model building

Textbook reading: Consult Course Schedule

Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation. Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.

In this lesson, we will focus on both aspects of EDA:

Numerical summarization
Data visualization.

This course is based on R software. There are several attractive features of R that make it a software of choice both in academia as well as in industry.

R is an open-source software and is free to download.
R is supported by 3,000+ packages to deal with large volumes of data in a wide variety of applications. For instance, the svd() function performs the singular value decomposition in a single line of coding, which cannot be so easily implemented in C, Java or Python.
R is quite versatile. After an algorithm is developed in R, the program may be sped up by transforming the R codes to other languages.
R is a mainstream analytical tool.

Reference:

The Popularity of Data Analysis Software by R.A. Muenchen,
R You Ready for R? by Ashlee Vance
R Programming for Data Science by Roger Peng

The following diagram shows that in recent times R is gaining popularity as monthly programming discussion traffic shows explosive growth of discussions regarding R.

R has a vibrant user community. As a result of that R has the most web site links that point to it.

R can be installed from the CRAN website https://www.r-project.org/ [1] following the instructions. Downloading R-Studio is strongly recommended. To develop familiarity with R it is suggested to follow through the material in Introduction to R [2]. For further information refer to the Course Syllabus. Other useful websites on R are https://stackoverflow.com/questions/tagged/r [3] and https://rseek.org/ [4].

One of the objectives of this course is to strengthen the basics in R. The R-Labs given in the textbook are followed closely. Along with the material in the text, two other features in R are introduced.

R Markdown: This allows the users to knit the R codes and outputs directly into the document.
R library(ggplot2): A very useful and sophisticated set of plotting functions to produce high quality graphs

1(b) .1 - What is Data

Introduction

Anything that is observed or conceptualized falls under the purview of data. In a somewhat restricted view, data is something that can be measured. Data represent facts, or something that have actually taken place, observed and measured. Data may come out of passive observation or active collection. Each data point must be rooted in a physical, demographical or behavioural phenomenon, must be unambiguous and measurable. Data is observed on each unit under study and stored in an electronic device.

Data denotes a collection of objects and their attributes.
An attribute (feature, variable, or field) is a property or characteristic of an object.
A collection of attributes describe an object (individual, entity, case, or record).

Often these attributes are referred to as variables. Attributes contain information regarding each unit of observation. Depending on how many different types of information are collected from each unit, the data may be univariate, bivariate or multivariate.

Data can have varied forms and structures but in one criterion they are all the same – data contains information and characteristics that separates one unit or observation from the others.

Types of Attributes

Nominal: Qualitative variables that do not have a natural order, e.g. Hair color, Religion, Residence zipcode of a student.

Ordinal: Qualitative variables that have a natural order, e.g. Grades, Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is excellent), Street numbers in New York City.

Interval: Measurements where the difference between two values is meaningful, e.g. Calendar dates, Temperature in Celsius or Fahrenheit.

Ratio: Measurements where both difference and ratio are meaningful, e.g. Temperature in Kelvin, Length, Counts.

Discrete and Continuous Attributes

Discrete Attribute

A variable or attribute is discrete if it can take a finite or a countably infinite set of values. A discrete variable is often represented as an integer-valued variable. A binary variable is a special case where the attribute can assume only two values, usually represented by 0 and 1. Examples of a discrete variable are the number of birds in a flock; the number of heads realized when a coin is flipped 10 times, etc.

Continuous Attribute

A variable or attribute is continuous if it can take any value in a given range; possibly the range being infinite. Examples of continuous variables are weights and heights of birds, temperature of a day, etc.

In the hierarchy of data, nominal is at the lowermost rank as it carries the least information. The highest type of data is ratio since it contains the maximum possible information. While analyzing the data, it has to be noted that procedures applicable for lower data type can be applied for a higher one, but the reverse is not true. Analysis procedure for nominal data can be applied to interval type data, but it is not recommended since such a procedure completely ignores the amount of information an interval type data carries. But the procedures developed for interval or even ratio type data cannot be applied to nominal nor to ordinal data. A prudent analyst should recognize each data type and then decide on the methods applicable.

1(b) .2 - Numerical Summarization

Summary Statistics

Vast amount of numbers on a large number of variables need to be properly organized to extract information from them. Broadly speaking there are two methods to summarize data: visual summarization and numerical summarization. Both have their advantages and disadvantages and applied jointly they will get the maximum information from raw data.

Summary statistics are numbers computed from the sample that present a summary of the attributes.

Measures of Location

They are single numbers representing a set of observations. Measures of location also includes measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode and the Quartiles.

Mean is the arithmatic average of all the observations. The mean equals the sum of all observations divided by the sample size.

Median is the middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency.

Mode is the the most frequently occuring value in the data set. This makes more sense when attributes are not continuous.

Quartiles are division points which split data into four equal parts after rank-ordering them. Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile). They are not necessarily four equidistance point on the range of the sample.

Similarly Deciles and Percentiles are defined as division points that divide the rank-ordered data into 10 and 100 equal segments.

Note that the mean is very sensitive to outliers (extreme or unusual observations) whereas the median is not. The mean is affected if even a single observation is changed. The median, on the other hand, has a 50% breakdown which means that unless 50% values in a sample change, median will not change.

Measures of Spread

Measures of location is not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion are the Variance, the Standard Deviation, the Interquartile Range and Range.

Variance measures how far data values lie from the mean. It is defined as the average of the squared differences between the mean and the individual data values.

Standard Deviation is the square root of the variance. It is defined as the average distance between the mean and the individual data values.

Interquartile range (IQR) is the difference between Q3 and Q1. IQR contains the middle 50% of data.

Range is the difference between the maximum and minimum values in the sample.

Measures of Skewness

In addition to the measures of location and dispersion, the arrangement of data or the shape of the the data distribution is also of considerable interest. The most 'well-bahaved' distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness measures whether or not a distribution has a single long tail.

Skewness is measured as:

\[ \frac{\sqrt{n} \left( \Sigma \left(x_{i} - \bar{x} \right)^{3} \right)}{\left(\Sigma \left(x_{i} - \bar{x} \right)^{2}\right)^{\frac{3}{2}}}. \]

The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.

Self-check

Think About It!

Calculate the answers to these questions then click the icon on the left to reveal the answer.

1. Suppose we have the data: 3, 5, 6, 9, 0, 10, 1, 3, 7, 4, 8. Calculate the following summary statistics:

Mean
Median
Mode
Q1 and Q3
Variance and Standard Deviation
IQR
Range
Skewness

Measures of Correlation

All the above summary statistics are applicable only for univariate data where information on a single attribute is of interest. Correlation describes the degree of the linear relationship between two attributes, X and Y.

With X taking the values x(1), … , x(n) and Y taking the values y(1), … , y(n), the sample correlation coefficient is defined as:

\[\rho (X,Y)=\frac{\sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )\left ( y(i)-\bar{y} \right )}{\left( \sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )^2\sum_{i=1}^{n}\left ( y(i)-\bar{y} \right )^2\right)^\frac{1}{2}}\]

The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.

In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values the relation is a perfect straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction.

example correlation coefficients

Try this!!!

Try the applet "CorrelationPicture" and "CorrelationPoints" from the University of Colorado at Boulder:

https://www.bolderstats.com/jmsl/doc/ [5]

Try the applet "Guess the Correlation" from the Rossman/Chance Applet Collection:

https://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html [6]

1(b).2.1: Measures of Similarity and Dissimilarity

Similarity and Dissimilarity

Distance or similarity measures are essential to solve many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Similarity Measure

Numerical measure of how alike two data objects are.
Often falls between 0 (no similarity) and 1 (complete similarity).

Dissimilarity Measure

Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different).

Proximity refers to a similarity or dissimilarity.

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute Type	Similarity	Dissimilarity
Nominal	\(s=\begin{cases} 1 & \text{ if } p=q \\ 0 & \text{ if } p\neq q \end{cases}\)	\(d=\begin{cases} 0 & \text{ if } p=q \\ 1 & \text{ if } p\neq q \end{cases}\)
Ordinal	\(s=1-\frac{\left \\| p-q \right \\|}{n-1}\) (values mapped to integer 0 to n-1, where n is the number of values)	\(d=\frac{\left \\| p-q \right \\|}{n-1}\)
Interval or Ratio	\(s=1-\left \\| p-q \right \\|, s=\frac{1}{1+\left \\| p-q \right \\|}\)	\(d=\left \\| p-q \right \\|\)

Common Properties of Dissimilarity Measures

Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known properties:

d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
d(p, q) = d(q,p) for all p and q,
d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric.Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

Euclidean Distance

Assume that we have measurements x_ik, i = 1, … , N, on variables k = 1, … , p (also called attributes).

The Euclidean distance between the ith and jth objects is

\[d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

for every pair (i, j) of observations.

The weighted Euclidean distance is

\[d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]

If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, x_ik , i = 1, … , N, k = 1, … , p, the Minkowski distance is

\[d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}, \]

where λ ≥ 1. It is also called the L_λmetric.

λ = 1 : L₁ metric, Manhattan or City-block distance.
λ = 2 : L₂ metric, Euclidean distance.
λ → ∞ : L_∞ metric, Supremum distance.

\[ \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \]

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance

Let X be a N × p matrix. Then the i^th row of X is

\[x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\]

The Mahalanobis distance is

\[d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\]

where ∑ is the p×p sample covariance matrix.

Self-check

Think About It!

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

1. We have \(X= \begin{pmatrix}
1 & 3 & 1 & 2 & 4\\
1 & 2 & 1 & 2 & 1\\
2 & 2 & 2 & 2 & 2
\end{pmatrix}\).

Calculate the Euclidan distances.
Calculate the Minkowski distances (λ=1 and λ→∞ cases).

2. We have \(X= \begin{pmatrix}
2 & 3 \\
10 & 7 \\
3 & 2
\end{pmatrix}\).

Calculate the Minkowski distance (λ = 1, λ = 2, and λ → ∞ cases) between the first and second objects.
Calculate the Mahalanobis distance between the first and second objects.

Common Properties of Similarity Measures

Similarities have some well known properties:

s(p, q) = 1 (or maximum similarity) only if p = q,
s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

table

Simple Matching and Jaccard Coefficients

Simple matching coefficient = (n_1,1+ n_0,0) / (n_1,1 + n_1,0 + n_0,1 + n_0,0).
Jaccard coefficient = n_1,1 / (n_1,1 + n_1,0 + n_0,1).

Self-check

Think About It!

Calculate the answers to the question and then click the icon on the left to reveal the answer.

1. Given data:

p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1

The frequency table is table

Calculate the Simple matching coefficient and the Jaccard coefficient.

1(b) .3 - Visualization

To understand thousands of rows of data in a limited time there is no alternative to visual representation. Objective of visualization is to reveal the hidden information through simple charts and diagrams. Visual representation of data is the first step towards data exploration and formulation of analytical relationship among the variables. In a whirl of complex and voluminous data, visualization in one, two and three dimension helps data analysts to sift through data in a logical manner and understand the data dynamics. It is instrumental in identifying patterns and relationships among groups of variables. Visualization techniques depend on the type of variables. Techniques available to represent nominal variables are generally not suitable for visualizing continuous variables and vice versa. Data often contains complex information. It is easy to internalize complex information through visual mode. Graphs, charts and other visual representation provide quick and focused summarization.

Tools for Displaying Single Variables

Histogram

Histograms are the most common grapical tool to represent continuous data. On the horizontal axis the range of the sample is plotted. On the vertical axis is plotted the frequencies or relative frequencies of each class. The class width has an impact on the shape of the histogram. The histograms in the previous section were drawn from a random sample generated from theoretical distributions. Here we consider a real example to construct histograms.

The data set used for this purpose is the Wage data that is included in the ISLR package in R. A full description of the data is given in the package. The following R code produces the figure below which illustrates the distribution of wage for all 3000 workers.

Sample R code for Distribution of Wage

library(ISLR)
View(Wage)
with(Wage, hist(wage, nclass=20, col="grey", border="navy", main="", xlab="Wage", cex=1.2))
title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4)

Distribution of Wage histogram

The data is mostly symmetrically distributed but there is a small bimodality in the data which is indicated by a small hump towards the right tail of the distribution.

The data set contains a number of categorical variables one of which is Race. A natural question is whether the wage distribution is the same across Race. There are several libraries in R which may be used to construct histograms across levels of a categorical variables and many other sophisticated graphs and charts. One such library is ggplot2. Details of functionalities of this library will be given in the R code below.

Sample R Code for Histogram by Race

# Histogram: Wage data by Race
library(ggplot2)
library(ISLR)
p <- ggplot(data = Wage, aes(x=wage)) 
p <- p + geom_histogram(binwidth=25, aes(fill=race))
p <- p + scale_fill_brewer(palette="Set1")
p <- p + facet_wrap( ~ race, ncol=2)
p <- p + labs(x="Wage", y="Frequency")+ theme(axis.title = 
element_text(color="black", face="bold"))
p <- p + ggtitle("Histogram of Wage by Race") + theme(plot.title = 
element_text(color="black", face="bold", size=16))
p

In the following figures histograms are drawn for each Race separately.

Histogram of Wage by Race

Because of huge disparity among the counts of the different races, the above histograms may not be very informative. Code for an alternative visual display of the same information is shown below, followed by the plot.

Sample R Code for Histograms by Race

library(ggplot2)
library(ISLR)
p1 <- ggplot(Wage, aes(x=wage,fill=race))+geom_histogram(binwidth=15,position="identity")
p11 <- p1 + scale_fill_manual(values = alpha(c("mediumvioletred","navy", 
"green", "red"), 0.5))
p2 <- p11 + labs(x="Wage", y="Frequency")+ theme(axis.title = 
element_text(color="black", face="bold"))
p3 <- p2 + ggtitle("Histogram of Wage by Race") + theme(plot.title = 
element_text(color="black", face="bold", size=16))
p3

Historgram of Wage by Race

The second type of histogram also may not be the best way of presenting all the information. However further clarity is seen in the small concentration at the right tail.

Boxplot

Boxplot is used to describe shape of a data distribution and especially to identify outliers. Typically an observation is an outlier if it is either less than Q₁ - 1.5 IQR or greater than Q₃+ 1.5 IQR, where IQR is the inter-quartile range defined as Q₃ - Q₁. This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q₁ - 3 IQR, Q₃+ 3 IQR] are only identified as outliers.

Sample R Code for Boxplot of Distribution of Wage

library(ISLR)
with(Wage,boxplot(wage,col="grey", border="navy", main="", xlab="Wage",pch = 19, cex=0.8))
title(main = "Distribution of Wage", cex=1.2, col.main="navy", font.main=4)

Here is the boxplot that results:

boxplot of distribution of wage

The boxplot of the Wage distribution clearly identifies many outliers. It is a reflection of the histogram depicting the distribution of Wage. The story is clearer from the boxplots drawn on the wage distribution for individual races. Here is the R code:

Sample R Code for Boxplot of Wage by Race

# Boxplot: Wage data by race
library(ggplot2)
library(ISLR)
p1 <- ggplot(Wage, aes(x=race,y=wage))+geom_boxplot()
p2 <- p1 + labs(x="Race", y="Wage")+ theme(axis.title = 
element_text(color="black", face="bold", size = 12))
p3 <- p2 + ggtitle("Boxplot of Wage by Race") + theme(plot.title = 
element_text(color="black", face="bold", size=16))
p3

Here is the boxplot that results:

Boxplot of Wage by Race

Tools for Displaying Relationships Between Two Variables

Scatterplot

The most standard way to visualize relation between two variables is a scatterplot. It shows the direction and strength of association between two variables, but does not quantify. Scatterplots also help to identify unusual observations. In the previous section (Section 1(b).2) a set of scatterplots are drawn for different values of the correlation coefficient. The data there is generated from a theoretical distribution of multivariate normal distribution with various values of the correlation paramater. Below is the R code used to obtain a scatterplot for these data:

Sample R Code for Relationship of Age and Wage

library(ISLR)
with(Wage, plot(age, wage, pch = 19, cex=0.6))
title(main = "Relationship between Age and Wage")

The following is the scatterplot of the variables Age and Wage for the Wage data.

Relationship between Age and Wage

It is clear from the scatterplot that the Wage does not seem to depend on Age very strongly. However a set of points is towards top are very different from the rest. A natural follow-up question is whether Race has any impact on the Age-Wage dependency, or the lack of it. Here is the R code and then the new plot:

Sample R Code for Relationship of Age and Wage by Race

# Scatterplot: Wage vs. Age by race
library(ISLR)
with(Wage, plot(age, wage, col = c("lightgreen","navy", "mediumvioletred", 
"red")[race], pch = 19, cex=0.6))
legend(70, 310, legend=levels(Wage$race), col=c("lightgreen","navy", 
"mediumvioletred", "red"), bty="n", cex=0.7, pch=19)
title(main = "Relationship between Age and Wage by Race")

Relationship between Age and Wage and Race

We have noted before that the disproportionately high number of Whites in the data masks the effects of the other races. There does not seem to be any association between Age and Wage, controlling for Race.

Contour plot

This is useful when a continuous attribute is measured on a spatial grid. They partition the plane into regions of similar values. The contour lines that form the boundaries of these regions connect points with equal values. In spatial statistics contour plots have a lot of applications.

Contour plots join points of equal probability. Within the contour lines concentration of bivariate distribution is the same. One may think of the contour lines as slices of a bivariate density, sliced horizontally. Contour plots are concentric; if they are perfect circles then the random variables are independent. The more oval shaped they are, the farther they are from independence. Note the conceptual similarity in the scatterplot series in Sec 1.(b).2. In the following plot the two disjoint shapes in the interior-most part indicate that a small part of the data is very different from the rest.

Here is the R code for the contour plot that follows:

Sample R Code for Contour Plot of Age and Wage

# Contour Plot: Age and Wage
library(ggplot2)
library(ISLR)
d0 <- ggplot(Wage,aes(age, wage))+ stat_density2d()
d0 <- d0 +labs(x="Age", y="Wage")+ theme(axis.title = 
element_text(color="black", face="bold"))
d0 + ggtitle("Contour Plot of Age and Wage") + theme(plot.title = 
element_text(color="black", face="bold", size=16))

Contour plot of Age and Wage

Tools for Displaying More Than Two Variables

Scatterplot Matrix

Displaying more than two variables on a single scatterplot is not possible. Scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.

The data set used to display scatterplot matrix is the College data that is included in the ISLR package. A full description of the data is given in the package. Here is the R code for the scatterplot matrix that follows:

Sample R Code for Scatterplot Matrix of College Attributes

library(ISLR)
attach(College)
library(car)
X <- cbind(Apps, Accept, Enroll, Room.Board, Books)
scatterplotMatrix(X, diagonal=c("boxplot"), reg.line=F, smoother=F, pch=19, cex=0.6, col="blue")
title (main="Scatterplot Matrix of College Attributes", col.main="navy", font.main=4, line = 3)

Scatterplot Matrix of College Attributes

Parallel Coordinates

An innovative way to present multiple dimensions in the same figure is by using parallel coordinate systems. Each dimension is presented by one coordinate and instead of plotting coordinates at right angle to one another, each coordinate is placed side-by-side. The advantage of such arrangement is that, many different continuous and discrete variables can be handled within parallel coordinate system, but if the number of observations is too large, the profiles do not separate out from one another and patterns may be missed.

The illustration below corresponds to the Auto data from ISLR package. Only 35 cars are considered but all dimensions are taken into account. The cars considered are different varieties of Toyota and Ford, categorized into two groups: produced before 1975 and produced in 1975 or after. The older models are represented by dotted lines whereas the newer cars are represented by dashed lines. The Fords are represented by blue colour and Toyotas are represented by pink color. Here is the R code for the profile plot of this data that follows:

Sample R Code for Profile Plot of Toyota and Ford Cars

library(ISLR)
library(MASS)
Comp1<-read.csv("Comp1.csv") ## This is part of the Auto data in ISLR
Y <- with(Comp1,cbind(cylinders, weight, horsepower, displacement, acceleration, mpg))
# Colors by condition:
car.colors<-ifelse(test = Comp1$Make=="Ford", yes = "blue", no = "magenta")
# Line type by condition:
car.lty<-ifelse(test = Comp1$year < 75, yes = "dotted", no = "longdash")
parcoord(Y, col = car.colors, lty = car.lty, var.label=T)
mtext ("Profile Plot of Toyota and Ford Cars", line = 2)

Profile plot of Toyota and Ford Cars

The differences among the four groups is very clear from the figure. Early Ford models had 8 cylinders, were heavy and had high horsepower and displacement. Naturally they had low MPG and less time to accelerate. No Toyota belonged to this category. All Toyota cars are built after 1975, have 4 cylinders (one exception only) and on MPG performance belongs to the upper half of the distribution. Note that only 35 cars are compared in the profile plot. Hence each car can be followed over all the attributes. However had the number of observations been higher, the distinction among the profiles would have been lost and the plot would not be informative.

Interesting Multivariate Plots

Following are some interesting visualization of multivariate data. In Star Plot, stars are drawn according to rules as defined on the characteristics. Each axis represents one attribute and the solid lines represent each item’s value on that attribute. All attributes of the observations are possible to be represented; however for the sake of clarity on the graph only 10 attributes are chosen.

Again, the starplot follows the R code for generating the plot:

Sample R Code for Starplot of College Data

library(graphics)
require(grDevices)
CollegeSmall<-read.csv("CollegeSmall.csv") ## This is part of the College data in ISLR
stars(CollegeSmall)
mtext ("Starplot of College Data",line=2)

Starplot of College data

Another interesting plot technique with multivariate data is Chernoff Face where attributes of each observation are used to draw different feature of the face. A comparison of 30 colleges and universities from the College dataset is compared below.

Again, R code and then the plot follows:

Sample R Code for Comparison of Colleges and Universities

library(TeachingDemos)
CollegeSmall<-read.csv("CollegeSmall.csv") ## This is part of the College data in ISLR
faces(CollegeSmall[,3:19],scale=T,nrow=6, ncol=5,labels=CollegeSmall[,1])
mtext ("Comparison of Colleges and Universities",line=2)

Comparison of Colleges and Universities

For comparison of a small number of observations on up to 15 attributes, Chernoff’s face is a useful technique. However, whether two items are more similar or less, depends on interpretation.

1(b) .4 - R Scripts

This course requires a fair amount of R coding. The textbook takes the reader through R codes relevant for the chapter in a step-by-step manner. Sample R codes are also provided in Visualization section. In this section a brief introduction is given on a few of the important and useful features of R.

Introductions to R are available at https://onlinecourses.science.psu.edu/statprogram/node/50 [7] and https://cran.r-project.org/doc/manuals/R-intro.html [8] . There are many other online resources available for R. R users' groups are thriving and highly communicative. A few additional resources are mentioned in the Course Syllabus.

One of the most important features of R is its libraries. They are freely downloadable from CRAN site. It is not possible to make a list of ALL or even MOST R packages. The list is ever changing as R users community is continuously building and refining the available packages. The link below is a good starting point for a list of packages for data manipulation and visualization.

https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages [9]

R Library: ggplot2

R has many packages and plotting options for data visualization but possibly none of them are able to produce as beautiful and as customizable statistoical graphics as ggplot2 does. It is unlike most other graphics packages because it has a deep underlying grammar based on the Grammar of Graphics (Wilkinson, 2005). It is composed of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful, because the user is not limited to a set of pre-specified graphics. The plots can be built up iteratively and edited later. The package is designed to work in a layered fashion, starting with a layer showing the raw data and then adding layers of annotations and statistical summaries.

The grammar of graphics is an answer to a question: what is a statistical graphic?

In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

A brief description of the main components are as below:

The data and a set of aesthetic mappings describe how variables in the data are mapped to various aesthetic attributes
Geometric objects, geoms for short, represent what is actually on the plot: points, lines, polygons, etc.
Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model. Stats are optional, but very useful.
A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

The basic command for plotting is qplot(X, Y, data = <data name>) (quick plot!). Unlike the most common plot() command, qplot() can be used for producing many other types of graphics by varying geom(). Examples of a few common geom() are given below.

geom = "point" is the default
geom = "smooth" fits a smoother to the data and displays the smooth and its standard error
geom = "boxplot" produces a box-and-whisker plot to summarise the distribution of a set of points

For continuous variables

geom = "histogram" draws a histogram
geom = "density" draws a density plot

For discrete variables

geom = "bar" produces a bar chart.

Aesthetics and faceting are two important features of ggplot2. Color, shape, size and other aesthetic arguments are used if observations coming from different subgroups are plotted on the same graph. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.

From Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis, Springer.

R Markdown

Markdown is an extremely useful facility in R which lets an user incorporate R codes and outputs directly in a document. For a comprehensive knowledge on Markdown and how to use it you may consult R Markdown in the course STAT 485 [10].