2.1.1 - One Categorical Variable

2.1.1 - One Categorical Variable

Data concerning one categorical variable can be summarized using a proportion.

Proportion
\(Proportion=\dfrac{Number\;in\;the\;category}{Total\;number}\)

The symbol for a sample proportion is \(\widehat{p}\) and is read as "p-hat." The symbol for a population proportion is \(p\). 

The formula for a sample proportion may also be written as \(\widehat p = \frac{x}{n}\) where \(x\) is the number in the sample with the trait of interest and \(n\) is the sample size.

A proportion must be between 0 and 1.00.

Example: Black Cards

A standard 52-card deck contains \(26\) red cards and \(26\) black cards. What proportion of cards are black?

\(p=\dfrac{26}{52}=0.50\)

The symbol \(p\) was used because this is the proportion of all cards (i.e., the population) that are black.

Example: World Campus Undergraduate Students

In the Fall 2014 semester, there were \(82,382\) undergraduate students enrolled in Penn State. Of those, \(6,245\) were World Campus students. What proportion of all Penn State undergraduate students were World Campus students?

\(p=\dfrac{6245}{82382}=0.076\)

The symbol \(p\) was used because this is the proportion of all Penn State undergraduate students (i.e., the population) that are World Campus students.

Example: Broken Cookies

In a sample of \(30\) randomly selected packages of chocolate chip cookies, \(18\) contained broken cookies. What proportion of these selected packages had broken cookies?

\(\widehat{p}=\dfrac{18}{30}=0.60\)

These data were collected from a sample so the symbol \(\widehat{p}\) was used to denote a sample proportion. 


2.1.1.1 - Risk and Odds

2.1.1.1 - Risk and Odds

You may have heard the terms risk and odds before. They are both ways to communicate the likelihood of an event.

Risk and odds are often confused with one another. The formulas for computing risk and odds are different and their interpretations are different.

Risk

In statistics, the word risk communicates the likelihood of an event occurring. This is synonymous with probability or proportion (i.e., the formulas are the same).

Risk
The probability that an event will occur. It may be written as a decimal, a fraction, or a percent.
Risk
\(Risk= \dfrac{number \;with \;the\; outcome}{total\;number\;of\;outcomes}\)

Example: Asthma Risk

\(60\) out of \(1000\) teens have asthma.

\(risk=\dfrac{60}{1000}=0.06\)

This means that \(6\%\) of teens experience asthma.

Example: Flu Risk

\(45\) out of \(100\) children get the flu each year.

\(risk=\dfrac{45}{100}=0.45\) or \(45\%\)

Odds

Odds
Express risk by comparing the likelihood of an event happening to the likelihood it does not happen.
Odds

\(odds = \dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}\)

OR

\(odds=\dfrac{risk}{1-risk}\)

We often interpret odds in relation to the value of 1. For example, if the odds of a game are in favor of the house 2 to 1, that means for every 2 games the house wins it will lose 1. 

Example: Passing Odds

In one large class, 850 students passed an exam while 150 students failed. Because we have the raw counts, we can use the first odds formula.

\(odds=\dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}=\dfrac{850}{150}=5.667\)

The odds of passing were 5.667 to 1. In other words, for every 5.667 students who passed the exam there was 1 who failed.

Example: Flu Odds

The risk of a child getting the flu is \(45\%\) which can also be written as \(0.45\). Because we have the risk, we can use the second odds formula.

\(odds=\dfrac{risk}{1-risk}=\dfrac{0.45}{1-0.45}=\dfrac{0.45}{0.55}=0.818\)

The odds of a child getting the flu is \(0.818\) to \(1\).


2.1.1.2 - Visual Representations

2.1.1.2 - Visual Representations

Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables. Below are a frequency table, a pie chart, and a bar graph for data concerning Penn State’s undergraduate enrollments by campus in Fall 2017.

Note that in the bar chart, the bars are separated by a space. The spaces between the bars signify that this is a categorical variable. On the following pages you will learn how to make these graphs using Minitab Express.

Frequency Table

A table containing the counts of how often each category occurs.

Tally
Campus Count Percent
University Park 40835 48.5%
Commonwealth Campuses 29388 34.9%
PA College of Technology 5465 6.5%
World Campus 8513 10.1%
Total 84201 100.0%

Penn State Fall 2017 Undergraduate Enrollments

Pie chart

Graphical representation for categorical data in which a circle is partitioned into “slices” on the basis of the proportions of each category.

Pie Chart of Campus
Category
  •  University Park (48.5%)
  •  Commonwealth Campuses (34.9%)
  •  PA College of Technology (6.5%)
  •  World Campus (10.1%)
Penn State Fall 2017 Undergraduate Enrollments
Bar chart

Graphical representation for categorical data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each category; bars are separated by space.

Minitab Express Bar Chart for Fall 2017 Penn State Undergraduate Enrollments

Penn State Fall 2017 Undergraduate Enrollments

Tips

Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be more difficult to read. In those cases, a frequency table or bar chart may be more appropriate.

When selecting a visual display for your data you should first determine how many variables you are going to display and whether they are categorical or quantitative. Then, you should think about what you are trying to communicate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which best communicates your data.


2.1.1.2.1 - Minitab Express: Frequency Tables

2.1.1.2.1 - Minitab Express: Frequency Tables

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state. To get an idea of the pattern of variation of a categorical variable such as region, we can display the information with a frequency table, pie chart, or bar graph.

MinitabExpress  – Frequency Table

To create a frequency table in Minitab Express:

  1. Open the data set:
  2. On a PC: In the menu bar select STATISTICS > Describe > Tally
    On a Mac: In the menu bar select Statistics > Summary Statistics > Tally
  3. Double click the variable Region in the box on the left to insert the variable into the Variable box
  4. Under Statistics, check Counts and Percents
  5. Click OK

This should result in the following frequency table:

Tally
Region Count Percent
ENC 5 9.8039%
ESC 4 7.8431%
MA 3 5.8824%
MTN 8 15.6863%
NE 6 11.7647%
PAC 5 9.8039%
SA 9 17.6471%
WNC 7 13.7255%
WSC 4 7.8431%
N= 51  
Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.


2.1.1.2.2 - Minitab Express: Pie Charts

2.1.1.2.2 - Minitab Express: Pie Charts

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state. 

MinitabExpress  – Pie Chart (Raw Data)

To create a pie chart in Minitab Express:

  1. Open the data set:
  2. On a PC or Mac: Select Graphs > Pie Chart
  3. Select Counts of Unique Values
  4. Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
  5. Click OK

This should result in the pie chart below:

Pie Charts created using Minitab Express
Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

Summarized Data

In the examples above raw data were used. In other words, the dataset contained one row for each case. It is also possible to use Minitab Express to construct a pie chart given summarized data, for example, if you had your counts in a frequency table. If this were the case, in step 3 you would select Summarized Data and enter the names of the categories in the Category names box and the frequency counts in the Summary values box.


2.1.1.2.3 - Minitab Express: Bar Charts

2.1.1.2.3 - Minitab Express: Bar Charts

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state. 

MinitabExpress  – Bar Chart (Raw Data)

To create a bar graph in Minitab Express:

  1. Open the data set 
  2. On a PC or Mac: Select Graphs > Bar Chart
  3. Use the default from the drop down Bars represent of Counts of unique values in a categorical variable
  4. Select Simple
  5. Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
  6. Click OK

This should result in the bar graph below:

Chart of Region

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

Summarized Data

In the examples above raw data were used. In other words, the Minitab Express file consisted of one row for each case. We can also use Minitab Express to construct a bar chart with summarized data, for example, if you had data in a frequency table. To do this, in the third step shown above you will change the dropdown of Bars represent to Summarized values for each category in a table. You will still select Simple. The Summary variable will be the numerical values and the Categorical variable will be the names of the categories. 


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility