Key Learning Goals for this Lesson: |
Textbook reading: Consult Course Schedule |
Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation. Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.
In this lesson, we will focus on both aspects of EDA:
This course is based on R software. There are several attractive features of R that make it a software of choice both in academia as well as in industry.
Reference:
The following diagram shows that in recent times R is gaining popularity as monthly programming discussion traffic shows explosive growth of discussions regarding R.
R has a vibrant user community. As a result of that R has the most web site links that point to it.
R can be installed from the CRAN website https://www.r-project.org/ [1] following the instructions. Downloading R-Studio is strongly recommended. To develop familiarity with R it is suggested to follow through the material in Introduction to R [2]. For further information refer to the Course Syllabus. Other useful websites on R are https://stackoverflow.com/questions/tagged/r [3] and https://rseek.org/ [4].
One of the objectives of this course is to strengthen the basics in R. The R-Labs given in the textbook are followed closely. Along with the material in the text, two other features in R are introduced.
Anything that is observed or conceptualized falls under the purview of data. In a somewhat restricted view, data is something that can be measured. Data represent facts, or something that have actually taken place, observed and measured. Data may come out of passive observation or active collection. Each data point must be rooted in a physical, demographical or behavioural phenomenon, must be unambiguous and measurable. Data is observed on each unit under study and stored in an electronic device.
Often these attributes are referred to as variables. Attributes contain information regarding each unit of observation. Depending on how many different types of information are collected from each unit, the data may be univariate, bivariate or multivariate.
Data can have varied forms and structures but in one criterion they are all the same – data contains information and characteristics that separates one unit or observation from the others.
Nominal: Qualitative variables that do not have a natural order, e.g. Hair color, Religion, Residence zipcode of a student.
Ordinal: Qualitative variables that have a natural order, e.g. Grades, Rating of a service rendered on a scale of 1-5 (1 is terrible and 5 is excellent), Street numbers in New York City.
Interval: Measurements where the difference between two values is meaningful, e.g. Calendar dates, Temperature in Celsius or Fahrenheit.
Ratio: Measurements where both difference and ratio are meaningful, e.g. Temperature in Kelvin, Length, Counts.
Discrete Attribute
A variable or attribute is discrete if it can take a finite or a countably infinite set of values. A discrete variable is often represented as an integer-valued variable. A binary variable is a special case where the attribute can assume only two values, usually represented by 0 and 1. Examples of a discrete variable are the number of birds in a flock; the number of heads realized when a coin is flipped 10 times, etc.
Continuous Attribute
A variable or attribute is continuous if it can take any value in a given range; possibly the range being infinite. Examples of continuous variables are weights and heights of birds, temperature of a day, etc.
In the hierarchy of data, nominal is at the lowermost rank as it carries the least information. The highest type of data is ratio since it contains the maximum possible information. While analyzing the data, it has to be noted that procedures applicable for lower data type can be applied for a higher one, but the reverse is not true. Analysis procedure for nominal data can be applied to interval type data, but it is not recommended since such a procedure completely ignores the amount of information an interval type data carries. But the procedures developed for interval or even ratio type data cannot be applied to nominal nor to ordinal data. A prudent analyst should recognize each data type and then decide on the methods applicable.
Vast amount of numbers on a large number of variables need to be properly organized to extract information from them. Broadly speaking there are two methods to summarize data: visual summarization and numerical summarization. Both have their advantages and disadvantages and applied jointly they will get the maximum information from raw data.
Summary statistics are numbers computed from the sample that present a summary of the attributes.
They are single numbers representing a set of observations. Measures of location also includes measures of central tendency. Measures of central tendency can also be taken as the most representative values of the set of observations. The most common measures of location are the Mean, the Median, the Mode and the Quartiles.
Mean is the arithmatic average of all the observations. The mean equals the sum of all observations divided by the sample size.
Median is the middle-most value of the ranked set of observations so that half the observations are greater than the median and the other half is less. Median is a robust measure of central tendency.
Mode is the the most frequently occuring value in the data set. This makes more sense when attributes are not continuous.
Quartiles are division points which split data into four equal parts after rank-ordering them. Division points are called Q1 (the first quartile), Q2 (the second quartile or median), and Q3 (the third quartile). They are not necessarily four equidistance point on the range of the sample.
Similarly Deciles and Percentiles are defined as division points that divide the rank-ordered data into 10 and 100 equal segments.
Note that the mean is very sensitive to outliers (extreme or unusual observations) whereas the median is not. The mean is affected if even a single observation is changed. The median, on the other hand, has a 50% breakdown which means that unless 50% values in a sample change, median will not change.
Measures of location is not enough to capture all aspects of the attributes. Measures of dispersion are necessary to understand the variability of the data. The most common measure of dispersion are the Variance, the Standard Deviation, the Interquartile Range and Range.
Variance measures how far data values lie from the mean. It is defined as the average of the squared differences between the mean and the individual data values.
Standard Deviation is the square root of the variance. It is defined as the average distance between the mean and the individual data values.
Interquartile range (IQR) is the difference between Q3 and Q1. IQR contains the middle 50% of data.
Range is the difference between the maximum and minimum values in the sample.
In addition to the measures of location and dispersion, the arrangement of data or the shape of the the data distribution is also of considerable interest. The most 'well-bahaved' distribution is a symmetric distribution where the mean and the median are coincident. The symmetry is lost if there exists a tail in either direction. Skewness measures whether or not a distribution has a single long tail.
Skewness is measured as:
\[ \frac{\sqrt{n} \left( \Sigma \left(x_{i} - \bar{x} \right)^{3} \right)}{\left(\Sigma \left(x_{i} - \bar{x} \right)^{2}\right)^{\frac{3}{2}}}. \]
The figure below gives examples of symmetric and skewed distributions. Note that these diagrams are generated from theoretical distributions and in practice one is likely to see only approximations.
|
![]() |
![]() |
|
![]() |
Calculate the answers to these questions then click the icon on the left to reveal the answer.
1. Suppose we have the data: 3, 5, 6, 9, 0, 10, 1, 3, 7, 4, 8. Calculate the following summary statistics:
All the above summary statistics are applicable only for univariate data where information on a single attribute is of interest. Correlation describes the degree of the linear relationship between two attributes, X and Y.
With X taking the values x(1), … , x(n) and Y taking the values y(1), … , y(n), the sample correlation coefficient is defined as:
\[\rho (X,Y)=\frac{\sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )\left ( y(i)-\bar{y} \right )}{\left( \sum_{i=1}^{n}\left ( x(i)-\bar{x} \right )^2\sum_{i=1}^{n}\left ( y(i)-\bar{y} \right )^2\right)^\frac{1}{2}}\]
The correlation coefficient is always between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). If the correlation coefficient is 0, then there is no linear relationship between X and Y.
In the figure below a set of representative plots are shown for various values of the population correlation coefficient ρ ranging from - 1 to + 1. At the two extreme values the relation is a perfect straight line. As the value of ρ approaches 0, the elliptical shape becomes round and then it moves again towards an elliptical shape with the principal axis in the opposite direction.
Try the applet "CorrelationPicture" and "CorrelationPoints" from the University of Colorado at Boulder:
https://www.bolderstats.com/jmsl/doc/ [5]
Try the applet "Guess the Correlation" from the Rossman/Chance Applet Collection:
https://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html [6]
Distance or similarity measures are essential to solve many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.
Similarity Measure
Dissimilarity Measure
Proximity refers to a similarity or dissimilarity.
Here, p and q are the attribute values for two data objects.
Attribute Type | Similarity | Dissimilarity |
Nominal | \(s=\begin{cases} 1 & \text{ if } p=q \\ 0 & \text{ if } p\neq q \end{cases}\) |
\(d=\begin{cases} 0 & \text{ if } p=q \\ 1 & \text{ if } p\neq q \end{cases}\) |
Ordinal |
\(s=1-\frac{\left \| p-q \right \|}{n-1}\) (values mapped to integer 0 to n-1, where n is the number of values) |
\(d=\frac{\left \| p-q \right \|}{n-1}\) |
Interval or Ratio | \(s=1-\left \| p-q \right \|, s=\frac{1}{1+\left \| p-q \right \|}\) | \(d=\left \| p-q \right \|\) |
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known properties:
A distance that satisfies these properties is called a metric.Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also called attributes).
The Euclidean distance between the ith and jth objects is
\[d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]
for every pair (i, j) of observations.
The weighted Euclidean distance is
\[d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\]
If scales of the attributes differ substantially, standardization is necessary.
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik , i = 1, … , N, k = 1, … , p, the Minkowski distance is
\[d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}, \]
where λ ≥ 1. It is also called the Lλ metric.
\[ \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \]
Note that λ and p are two different parameters. Dimension of the data matrix remains finite.
Let X be a N × p matrix. Then the ith row of X is
\[x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\]
The Mahalanobis distance is
\[d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\]
where ∑ is the p×p sample covariance matrix.
Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.
1. We have \(X= \begin{pmatrix}
1 & 3 & 1 & 2 & 4\\
1 & 2 & 1 & 2 & 1\\
2 & 2 & 2 & 2 & 2
\end{pmatrix}\).
2. We have \(X= \begin{pmatrix}
2 & 3 \\
10 & 7 \\
3 & 2
\end{pmatrix}\).
Similarities have some well known properties:
The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.
Simple Matching and Jaccard Coefficients
To understand thousands of rows of data in a limited time there is no alternative to visual representation. Objective of visualization is to reveal the hidden information through simple charts and diagrams. Visual representation of data is the first step towards data exploration and formulation of analytical relationship among the variables. In a whirl of complex and voluminous data, visualization in one, two and three dimension helps data analysts to sift through data in a logical manner and understand the data dynamics. It is instrumental in identifying patterns and relationships among groups of variables. Visualization techniques depend on the type of variables. Techniques available to represent nominal variables are generally not suitable for visualizing continuous variables and vice versa. Data often contains complex information. It is easy to internalize complex information through visual mode. Graphs, charts and other visual representation provide quick and focused summarization.
Histograms are the most common grapical tool to represent continuous data. On the horizontal axis the range of the sample is plotted. On the vertical axis is plotted the frequencies or relative frequencies of each class. The class width has an impact on the shape of the histogram. The histograms in the previous section were drawn from a random sample generated from theoretical distributions. Here we consider a real example to construct histograms.
The data set used for this purpose is the Wage data that is included in the ISLR package in R. A full description of the data is given in the package. The following R code produces the figure below which illustrates the distribution of wage for all 3000 workers.
Sample R code for Distribution of Wage
The data is mostly symmetrically distributed but there is a small bimodality in the data which is indicated by a small hump towards the right tail of the distribution.
The data set contains a number of categorical variables one of which is Race. A natural question is whether the wage distribution is the same across Race. There are several libraries in R which may be used to construct histograms across levels of a categorical variables and many other sophisticated graphs and charts. One such library is ggplot2. Details of functionalities of this library will be given in the R code below.
Sample R Code for Histogram by Race
In the following figures histograms are drawn for each Race separately.
Because of huge disparity among the counts of the different races, the above histograms may not be very informative. Code for an alternative visual display of the same information is shown below, followed by the plot.
Sample R Code for Histograms by Race
The second type of histogram also may not be the best way of presenting all the information. However further clarity is seen in the small concentration at the right tail.
Boxplot is used to describe shape of a data distribution and especially to identify outliers. Typically an observation is an outlier if it is either less than Q1 - 1.5 IQR or greater than Q3 + 1.5 IQR, where IQR is the inter-quartile range defined as Q3 - Q1. This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q1 - 3 IQR, Q3 + 3 IQR] are only identified as outliers.
Sample R Code for Boxplot of Distribution of Wage
Here is the boxplot that results:
The boxplot of the Wage distribution clearly identifies many outliers. It is a reflection of the histogram depicting the distribution of Wage. The story is clearer from the boxplots drawn on the wage distribution for individual races. Here is the R code:
Sample R Code for Boxplot of Wage by Race
Here is the boxplot that results:
The most standard way to visualize relation between two variables is a scatterplot. It shows the direction and strength of association between two variables, but does not quantify. Scatterplots also help to identify unusual observations. In the previous section (Section 1(b).2) a set of scatterplots are drawn for different values of the correlation coefficient. The data there is generated from a theoretical distribution of multivariate normal distribution with various values of the correlation paramater. Below is the R code used to obtain a scatterplot for these data:
Sample R Code for Relationship of Age and Wage
The following is the scatterplot of the variables Age and Wage for the Wage data.
It is clear from the scatterplot that the Wage does not seem to depend on Age very strongly. However a set of points is towards top are very different from the rest. A natural follow-up question is whether Race has any impact on the Age-Wage dependency, or the lack of it. Here is the R code and then the new plot:
Sample R Code for Relationship of Age and Wage by Race
We have noted before that the disproportionately high number of Whites in the data masks the effects of the other races. There does not seem to be any association between Age and Wage, controlling for Race.
This is useful when a continuous attribute is measured on a spatial grid. They partition the plane into regions of similar values. The contour lines that form the boundaries of these regions connect points with equal values. In spatial statistics contour plots have a lot of applications.
Contour plots join points of equal probability. Within the contour lines concentration of bivariate distribution is the same. One may think of the contour lines as slices of a bivariate density, sliced horizontally. Contour plots are concentric; if they are perfect circles then the random variables are independent. The more oval shaped they are, the farther they are from independence. Note the conceptual similarity in the scatterplot series in Sec 1.(b).2. In the following plot the two disjoint shapes in the interior-most part indicate that a small part of the data is very different from the rest.
Here is the R code for the contour plot that follows:
Sample R Code for Contour Plot of Age and Wage
Displaying more than two variables on a single scatterplot is not possible. Scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.
The data set used to display scatterplot matrix is the College data that is included in the ISLR package. A full description of the data is given in the package. Here is the R code for the scatterplot matrix that follows:
Sample R Code for Scatterplot Matrix of College Attributes
An innovative way to present multiple dimensions in the same figure is by using parallel coordinate systems. Each dimension is presented by one coordinate and instead of plotting coordinates at right angle to one another, each coordinate is placed side-by-side. The advantage of such arrangement is that, many different continuous and discrete variables can be handled within parallel coordinate system, but if the number of observations is too large, the profiles do not separate out from one another and patterns may be missed.
The illustration below corresponds to the Auto data from ISLR package. Only 35 cars are considered but all dimensions are taken into account. The cars considered are different varieties of Toyota and Ford, categorized into two groups: produced before 1975 and produced in 1975 or after. The older models are represented by dotted lines whereas the newer cars are represented by dashed lines. The Fords are represented by blue colour and Toyotas are represented by pink color. Here is the R code for the profile plot of this data that follows:
Sample R Code for Profile Plot of Toyota and Ford Cars
The differences among the four groups is very clear from the figure. Early Ford models had 8 cylinders, were heavy and had high horsepower and displacement. Naturally they had low MPG and less time to accelerate. No Toyota belonged to this category. All Toyota cars are built after 1975, have 4 cylinders (one exception only) and on MPG performance belongs to the upper half of the distribution. Note that only 35 cars are compared in the profile plot. Hence each car can be followed over all the attributes. However had the number of observations been higher, the distinction among the profiles would have been lost and the plot would not be informative.
Following are some interesting visualization of multivariate data. In Star Plot, stars are drawn according to rules as defined on the characteristics. Each axis represents one attribute and the solid lines represent each item’s value on that attribute. All attributes of the observations are possible to be represented; however for the sake of clarity on the graph only 10 attributes are chosen.
Again, the starplot follows the R code for generating the plot:
Sample R Code for Starplot of College Data
Another interesting plot technique with multivariate data is Chernoff Face where attributes of each observation are used to draw different feature of the face. A comparison of 30 colleges and universities from the College dataset is compared below.
Again, R code and then the plot follows:
Sample R Code for Comparison of Colleges and Universities
For comparison of a small number of observations on up to 15 attributes, Chernoff’s face is a useful technique. However, whether two items are more similar or less, depends on interpretation.
This course requires a fair amount of R coding. The textbook takes the reader through R codes relevant for the chapter in a step-by-step manner. Sample R codes are also provided in Visualization section. In this section a brief introduction is given on a few of the important and useful features of R.
Introductions to R are available at https://onlinecourses.science.psu.edu/statprogram/node/50 [7] and https://cran.r-project.org/doc/manuals/R-intro.html [8] . There are many other online resources available for R. R users' groups are thriving and highly communicative. A few additional resources are mentioned in the Course Syllabus.
One of the most important features of R is its libraries. They are freely downloadable from CRAN site. It is not possible to make a list of ALL or even MOST R packages. The list is ever changing as R users community is continuously building and refining the available packages. The link below is a good starting point for a list of packages for data manipulation and visualization.
https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages [9]
R has many packages and plotting options for data visualization but possibly none of them are able to produce as beautiful and as customizable statistoical graphics as ggplot2 does. It is unlike most other graphics packages because it has a deep underlying grammar based on the Grammar of Graphics (Wilkinson, 2005). It is composed of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful, because the user is not limited to a set of pre-specified graphics. The plots can be built up iteratively and edited later. The package is designed to work in a layered fashion, starting with a layer showing the raw data and then adding layers of annotations and statistical summaries.
The grammar of graphics is an answer to a question: what is a statistical graphic?
In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
A brief description of the main components are as below:
The basic command for plotting is qplot(X, Y, data = <data name>) (quick plot!). Unlike the most common plot() command, qplot() can be used for producing many other types of graphics by varying geom(). Examples of a few common geom() are given below.
For continuous variables
For discrete variables
Aesthetics and faceting are two important features of ggplot2. Color, shape, size and other aesthetic arguments are used if observations coming from different subgroups are plotted on the same graph. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.
From Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis, Springer.
Markdown is an extremely useful facility in R which lets an user incorporate R codes and outputs directly in a document. For a comprehensive knowledge on Markdown and how to use it you may consult R Markdown in the course STAT 485 [10].
Links:
[1] https://www.r-project.org/
[2] https://onlinecourses.science.psu.edu/statprogram/node/50
[3] https://stackoverflow.com/questions/tagged/r
[4] https://rseek.org/
[5] https://www.bolderstats.com/jmsl/doc/
[6] https://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html
[7] https://onlinecourses.science.psu.edu/statprogram/node/50
[8] https://cran.r-project.org/doc/manuals/R-intro.html
[9] https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
[10] https://onlinecourses.science.psu.edu/stat485/node/29