Objective 1.2Discrete data is often referred to as categorical data because of the way observations can be collected into categories. Variables producing such data can be of any of the following types:
- Nominal (e.g., gender, ethnic background, religious or political affiliation)
- Ordinal (e.g., extent of agreement, school letter grades)
- Quantitative variables with relatively few values (e.g., number of times married)
Technically, a quantitative variable may take on any number of values and still be considered discrete, but it needs to be "countable". So, for example, the number of traffic accidents in a given time period may be considered discrete, but the amount of time between two consecutive accidents would be considered continuous. However, even a continuous variable may be used to produce discrete data if its range is divided or "coarsened" into intervals.
Note that many variables can be considered as either nominal or ordinal, depending on the purpose of the analysis. Consider majors in English, psychology, and computer science. This classification may be considered nominal or ordinal, depending on whether there is an intrinsic belief that it is "better" to have a major in computer science than in psychology or in English. Generally speaking, for a binary variable like pass/fail, ordinal or nominal consideration does not matter.
It should also be noted that numerically meaningful variables can be associated with any of the data types above, even the nominal type. For example, the gender categories of "man" and "woman" would themselves not be numerically meaningful, but if we let \(X\) be the number of men in a random sample, that would be considered a quantitative (random) variable.
Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze.
Examples
- Did you get the flu? (Yes or No) -- is a binary nominal categorical variable
- What was the severity of your flu? (Low, Medium, or High) -- is an ordinal categorical variable
Measurement Hierarchy Section
The main distinction between nominal and ordinal data is that the latter has a natural ordering (least to greatest, best to worst, etc.), whereas the former does not. If the ordered characteristic is ignored, however, ordinal data could be considered a special case of nominal data. Similarly, discrete quantitative data could be considered a special case of ordinal data, with the additional characteristic that values have numerical meaning. So, computations like differences and averages make sense. Thus, the hierarchy is
nominal < ordinal < quantitative
In terms of analyses, methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods designed for nominal data can be used for ordinal data but not vice versa. However, keep in mind that an analysis method may not be optimal if it ignores information available in the data.
One final note on the organization of these types is that quantitative variables may be further divided into "interval" and "ratio" types, depending on whether operations of subtraction and division make sense, but we will rarely need to make such distinction in this course.
Frequency Counts Section
While often not numerically meaningful originally, discrete data can be summarized with the frequency counts of individuals falling in the categories. If more than one variable is involved, counts can be measured either jointly or marginally for one variable by summing over categories of the other variable. Here are some examples.
Example: Eye Color Section
This is a typical frequency table for a single categorical variable. A sample of n = 96 persons is obtained, and the eye color of each person is recorded. The table then summarizes the responses by their frequencies.
Eye color | Count |
---|---|
Brown | 46 |
Blue | 22 |
Green | 26 |
Other | 2 |
Total | 96 |
Example: Admissions Data Section
A university offers only two-degree programs: English and computer science. Admission is competitive, and there is suspicion of discrimination against women in the admission process. Here is a two-way table of counts of all applicants by sex and admission status. These data can be used to measure the association between the sex of the applicants and their success in obtaining admission.
Admit | Deny | Total | |
---|---|---|---|
Male | 35 | 45 | 80 |
Female | 20 | 40 | 60 |
Total | 55 | 85 | 140 |
Example: Attitudes Towards War Section
Hypothetical attitudes of n = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: "This is a necessary war".
Attitude | Count |
---|---|
Strongly disagree | 35 |
Disagree | 27 |
Agree | 23 |
Strongly agree | 31 |
Total | 116 |
Example: Attitudes Towards War (cont.) Section
Working from the example above, suppose now that in addition to the four ordered categories, outcomes where the person wasn't sure or refused to answer were also recorded, giving n = 130 total counts divided up as follows.
Attitude | Count |
---|---|
Strongly disagree | 35 |
Disagree | 27 |
Agree | 23 |
Strongly agree | 31 |
Not sure | 6 |
Refusal | 8 |
Total | 130 |
Example: Dice Rolls Section
Suppose a six-sided die is rolled 30 times, and the die face that comes up is recorded. One possible set of outcomes is tabulated below.
Face | Count |
---|---|
1 | 3 |
2 | 7 |
3 | 5 |
4 | 10 |
5 | 2 |
6 | 3 |
Total | 30 |
Example: Number of Children in Families Section
Here's an example where the response categories are numerically meaningful: the number of children in n = 100 randomly selected families.
Number of children | Count |
---|---|
0 | 19 |
1 | 26 |
2 | 29 |
3 | 13 |
4-5 | 11 |
6+ | 2 |
Total | 100 |
Example: Household Incomes Section
The variable in this example is total gross income, recorded for a sample of n = 100 households.
Income | Count |
---|---|
below \$10,000 | 11 |
\$10,000–\$24,999 | 23 |
\$25,000–\$39,999 | 30 |
\$40,000–\$59,999 | 24 |
\$60,000 and above | 12 |
Total | 100 |
The original data (raw incomes) were essentially continuous, but any type of data, continuous or discrete, can be grouped or coarsened into categories.
Grouping data will typically result in some loss of information. How much information is lost depends on
- the number of categories and
- the question being addressed.
In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below \$10,000 has not been affected, but estimating the proportion of households with incomes above \$75,000 is now virtually impossible.