1(b).4 - R Scripts

This course requires a fair amount of R coding. The textbook takes the reader through R codes relevant for the chapter in a step-by-step manner. Sample R codes are also provided in the Visualization section. In this section, a brief introduction is given on a few of the important and useful features of R.

Introductions to R are available at Statistical R Tutorials and Cran R Project. There are many other online resources available for R. R users' groups are thriving and highly communicative. A few additional resources are mentioned in the Course Syllabus.

One of the most important features of R is its libraries. They are freely downloadable from CRAN site. It is not possible to make a list of ALL or even MOST R packages. The list is ever changing as R users community is continuously building and refining the available packages. The link below is a good starting point for a list of packages for data manipulation and visualization.

R Studio Useful Packages

R Library: ggplot2

R has many packages and plotting options for data visualization but possibly none of them are able to produce as beautiful and as customizable statistical graphics as ggplot2 does. It is unlike most other graphics packages because it has a deep underlying grammar based on the Grammar of Graphics (Wilkinson, 2005). It is composed of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because the user is not limited to a set of pre-specified graphics. The plots can be built up iteratively and edited later. The package is designed to work in a layered fashion, starting with a layer showing the raw data and then adding layers of annotations and statistical summaries.

The grammar of graphics is an answer to a question: what is a statistical graphic?

In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.

A brief description of the main components are as below:

  • The data and a set of aesthetic mappings describe how variables in the data are mapped to various aesthetic attributes
  • Geometric objects, geoms for short, represent what is actually on the plot: points, lines, polygons, etc.
  • Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model. Stats are optional but very useful.
  • A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

The basic command for plotting is qplot(X, Y, data = <data name>) (quick plot!). Unlike the most common plot() command, qplot() can be used for producing many other types of graphics by varying geom(). Examples of a few common geom() are given below.

  • geom = "point" is the default
  • geom = "smooth" fits a smoother to the data and displays the smooth and its standard error
  • geom = "boxplot" produces a box-and-whisker plot to summarise the distribution of a set of points

For continuous variables

  • geom = "histogram" draws a histogram
  • geom = "density" draws a density plot

For discrete variables

  • geom = "bar" produces a bar chart.

Aesthetics and faceting are two important features of ggplot2. Color, shape, size and other aesthetic arguments are used if observations coming from different subgroups are plotted on the same graph. Faceting takes an alternative approach: It creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset in an arrangement that facilitates comparison.

From Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis, Springer.

R Markdown

Markdown is an extremely useful facility in R which lets a user incorporate R codes and outputs directly in a document. For a comprehensive knowledge on Markdown and how to use it, you may consult R Markdown in the course STAT 485.