1.2 - Graphical Displays for Discrete Data
1.2 - Graphical Displays for Discrete DataIn the examples below, political party, sex, and general happiness are selected variables from the 2018 General Social Survey. Some of the original response categories were omitted or combined to simplify the interpretations; details are in the R code below.
Bar Plots
Not to be confused with a histogram, a bar plot is used for discrete or categorical data that is not continuous in nature. For this reason, bar plots are typically displayed with gaps between columns, unless certain groupings are to be emphasized. The height of each column can represent either a frequency count or a proportion for the corresponding category.
For nominal variables, such as Party ID and Sex, a simple bar plot is an effective way to illustrate the relative sizes of categories.
When plotting two variables together, one can be displayed in more of an explanatory role. Notice the difference in the way the following two plots are presenting the same data. The first is illustrating the distribution of Sex for each Party ID category, which puts Party ID in more of the explanatory role; the second is reversing these roles.
In most software packages, the default ordering for bar plot categories is alphabetical, which is usually fine for nominal data, but we can (and should) change the order to better represent ordinal data. In the plot below, categories for Happy are sorted from least happiness to greatest happiness.
Mosaic Plots
A visual display particularly well-suited for illustrating joint distributions for two (or more) discrete variables is the mosaic plot. Compared with the bar plot, category sizes in the mosaic plot more directly represent proportions of a whole. Compare the figure below to the bar plot for Happy above. This can potentially be misleading, however, if some categories are omitted. For this particular example, it should be understood that the additional responses of "No answer" and "Don't know" were possible but omitted for convenience.
In the case of two variables, the mosaic plot can illustrate their association. As with the bar plot above, one variable can play more of an explanatory role, depending on how the details are arranged. In the figure below, notice the vertical division by sex is slightly off-center. This gives the marginal information for Sex (the proportion of females was greater in this sample). Sex also plays the role of the explanatory variable in this plot in that the distribution of Party ID is viewed within each sex category. Thus, we see that among females, the proportion of Democrats is slightly higher, compared with the proportion of Democrats among males.
R
The R code to recreate the plots above:
library(dplyr)
gss = read.csv(file.choose(), header=T) # "GSS.csv"
str(gss) # structure
# omitting outlying responses
gss = gss[gss$partyid!="No answer",]
gss = gss[(gss$happy!="Don't know") & (gss$happy!="No answer"),]
# combine categories of partyid
gss$partyid = recode(gss$partyid,
"Ind,near dem" = "Independent",
"Ind,near rep" = "Independent",
"Not str democrat" = "Democrat",
"Strong democrat" = "Democrat",
"Not str republican" = "Republican",
"Strong republican" = "Republican")
# bar charts
party.tab = table(gss$partyid)
party.tab
prop.table(party.tab)
barplot(party.tab, main="Party ID")
two.tab = table(gss$sex, gss$partyid)
two.tab
prop.table(two.tab, margin=1) # row proportions
barplot(two.tab, legend=T, main="Party ID vs Sex")
barplot(two.tab, legend=T, main="Party ID vs Sex", beside=T)
barplot(table(gss$partyid, gss$sex), legend=T, main="Party ID vs Sex")
# ordered
gss$happy = factor(gss$happy,
levels = c("Not too happy", "Pretty happy", "Very happy"))
happy.tab = table(gss$happy)
happy.tab
prop.table(happy.tab)
barplot(happy.tab, main="General happines")
# mosaic plots
mosaicplot(happy.tab, main="General happiness")
dimnames(two.tab)
dimnames(two.tab)[[2]] = c("Dem","Ind","Other","Rep")
mosaicplot(two.tab, main="Party ID vs Sex", color=T)