Lesson 1(b): Exploratory Data Analysis (EDA)

Overview Section

Exploratory Data Analysis (EDA) may also be described as data-driven hypothesis generation. Given a complex set of observations, often EDA provides the initial pointers towards various learning techniques. The data is examined for structures that may indicate deeper relationships among cases or variables.

In this lesson, we will focus on both aspects of EDA:

  • Numerical summarization
  • Data Visualization

This course is based on R software. There are several attractive features of R that make it a software of choice both in academia as well as in industry.

  • R is an open-source software and is free to download.
  • R is supported by 3,000+ packages to deal with large volumes of data in a wide variety of applications. For instance, the svd() function performs the singular value decomposition in a single line of coding, which cannot be so easily implemented in C, Java or Python.
  • R is quite versatile. After an algorithm is developed in R, the program may be sped up by transforming the R codes into other languages.
  • R is a mainstream analytical tool.


  • The Popularity of Data Analysis Software by R.A. Muenchen,
  • R You Ready for R? by Ashlee Vance
  • R Programming for Data Science by Roger Peng

The following diagram shows that in recent times R is gaining popularity as monthly programming discussion traffic shows explosive growth of discussions regarding R.

R has a vibrant user community. As a result of that R has the most website links that point to it.

R can be installed from the CRAN website R-Project following the instructions. Downloading R-Studio is strongly recommended. To develop familiarity with R it is suggested to follow through the material in Introduction to R. For further information refer to the Course Syllabus. Other useful websites on R are Stack Overflow R Questions and R Seek.

One of the objectives of this course is to strengthen the basics in R. The R-Labs given in the textbook are followed closely. Along with the material in the text, two other features in R are introduced.

  • R Markdown: This allows the users to knit the R codes and outputs directly into the document.
  • R library ggplot2: A very useful and sophisticated set of plotting functions to produce high-quality graphs


Upon successful completion of this lesson, you should be able to:

  • Develop familiarity with R software.
  • Application of numerical and visual summarization of data.
  • Illustration of the importance of EDA before embarking on sophisticated model building.