1: Describing Data and Ethics

1: Describing Data and Ethics

Overview

When starting on the journey of learning statistics one of the first things that comes to mind are the controversial ways in which statistics are used. This elicits a conversation around the ethics of statistics. Consider the following scenario:

 Case-Study: Funding for Social Services

Maria works in social services for a program which advocates for abused, abandoned and neglected children. The program keeps detailed records of the children and their moves and services throughout the dependency care system. Much of the funding for the program is determined by the numbers associated with the record keeping. In the U.S., sources of funding varies from state to state but it is generally a mix of local, state and federal dollars. The allocation of grants and other funding is typically tied to the result of some metric that can be measured over time. Maria is struggling to understand how best to describe her metrics and perhaps more importantly how to ethically balance the need to comply with reporting and the need to justify additional funding. How can Maria work with her data and balancing the competing forces within an ethical decision-making process?

Since you are reading these notes you are probably thinking about statistics. Just like Maria in the example above, we need to think about how we use basic descriptive statistics in our work and daily lives. Descriptive statistics are basic operations performed on data. Descriptive statistics do not convey any significance, prediction, nor certainty, they simply describe the data in front of you. Before Maria can have any conversations with funding agencies or regulatory bodies she must be familiar with her clients, staff, hours of service, and outcomes, which she can learn from descriptive statistics. Maria would not (or should not) attend a meeting without being informed of this information. As statistical concepts become more complex, the importance of understanding the fundamental building blocks of descriptive statistics becomes more apparent, allowing us to correctly select and apply more advanced inferential statistics. These advanced techniques allow us to model the real world, based on our own local or sample data.

The first step in understanding descriptive statistics is to start thinking about some ways in which statistics and numbers are used. Select some media content on a study, poll, or trend, and you will see statistics. Whether political polling, the effectiveness of a product, or an analysis of your favorite sports team, statistics is all around us. Often, we are not even aware of some of the potential misuses of data in our world. For example, how was the sample drawn? What exact question was asked? What data were included and what data was not included? The information in the accompanying materials on ethics (links: ethics in statistics, use and misuse of numbers, and statistics ethics advice) are presented to get you thinking about the issues around ethics.

So, as we begin our journey of learning statistics, let’s start with some of the ethics involved in statistics along with some basic concepts of descriptive statistics.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify ethical dilemmas
  • Choose among alternative actions using the ASA ethical guidelines as a framework for values
  • Correctly identify measures of central tendency (mean, median, and mode)
  • Match appropriate descriptive statistics with the type of data
  • Match the appropriate graph with type of data

1.1 - Classifying Statistics

1.1 - Classifying Statistics

What's a variable?

Let’s get to know some of the descriptive statistics. The first challenge is determining what kind of data you are dealing with. There are generally two main types of data, qualitative and quantitative.

Qualitative data is typically words, but could also be images or other media, we will refer to this data in this course as categorical. Qualitative data may be labeled with numbers allowing this type of data to be analyzed using some of the techniques in the course. Maria might encounter some qualitative data in her work by labeling some of the mental health diagnoses (depression might be a “1”; anxiety a “2”). Note how these numerical labels are arbitrary. On the other hand, quantitative data is the focus of this course and is numerical. If Maria counts the number of patients seen each day, this data is quantitative.

Quantitative variables may be discrete or continuous. Discrete variables can only take on a limited number of values (e.g., only whole numbers) while continuous variables can take on any value and any value between two values (e.g., out to an infinite number of decimal places).

Before we get too far along, let’s take a moment to think about what the word “variable” means. A variable, notice this is a noun, not a verb, is an element or a feature. In statistics, this is typically something that is measured or recorded. In Maria’s case, the “number of patients” is a variable, the mental health diagnosis is a variable.

Summarizing Types of Variables

To summarize:

Categorical variable

Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups, also known as qualitative.

Example: Eye Color

Quantitative variable

Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.

Continuous variable

Characteristic that varies and can take on any value and any value between values

Example: Gas Prices

Discrete variable

Characteristic that varies and can only take on a set number of values

Example: Number of Customers

If a child admitted to Maria’s program is weighed upon admission, this weight is a quantitative variable because it takes on numerical values with meaningful magnitudes. It is a continuous variable because, theoretically, weight could take on any value. Any value between any two values is a possibility.

Example 1.1: Favorite Ice Cream Flavor

If each child at Maria’s organization is offered an ice cream cone, there may be three choices of flavors, chocolate, vanilla, or strawberry. The ice cream flavor is a categorical variable because the different flavors are categories with no meaningful order.


1.2 - Summarizing Data Visually

1.2 - Summarizing Data Visually

Summarizing Categorical Variables

Once the type of data, categorical or quantitative is identified, we can consider graphical representations of the data, which would be helpful for Maria to understand.

Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables. Below are a frequency table, a pie chart, and a bar graph for data concerning Mental Health Admission numbers.

Frequency Table
A table containing the counts of how often each category occurs.

Diagnosis

Count

Percent

Depression

40835

48.5%

Anxiety

29388

34.9%

OCD

5465

6.5%

Abuse

8513

10.1%

Total

84201

100.0%

Pie chart

Graphical representation for categorical data in which a circle is partitioned into “slices” on the basis of the proportions of each category.

Pie Chart of Diagnosis
Category
  • Depression (48.5%)
  • Anxiety (34.9%)
  • OCD (6.5%)
  • Abuse (10.1%)
 Pitfalls

One of the pitfalls of a pie chart is that if the “slices” only represent percentages the reader does not know how many actual people fall in each category.

Bar Chart
Graphical representation for categorical data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each category; bars are separated by space.

Note that in the bar chart, the categories of mental health diagnoses (bars) have white spaces in between them. The spaces between the bars signify that this is a categorical variable.

Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be more difficult to read. In those cases, a frequency table or bar chart may be more appropriate.

 Pitfalls

While bar charts can be presented as either percentages (in which case they are referred to as relative frequency charts) or counts, the differences among the heights of the bars are often assumed to be different, even when they are not.

Summarizing Quantitative Variables

But what of variables that are quantitative such as math SAT or percentage taking the SAT? For these variables we should use histograms or boxplots. Histograms differ from bar graphs in that they represent frequencies by area and not height. A good display will help to summarize a distribution by reporting the center, spread, and shape for that variable.

For now, the goal is to summarize the distribution or pattern of variation of a single quantitative variable.

Histogram
Histograms are graphical displays that can be used with one quantitative variable. In these plots the horizontal axis represents the values of the variable and the height of the bar represents how many observations are equal to the particular value.

From the histogram of children’s heights below, Maria can see that about 10 children have a height equal to “60”.

Histogram of Height (inches)

 Pitfalls

People frequently confuse bar charts and histograms. The first test should be to identify what kind of data you are charting (or what kind of data was charted), quantitative or categorical. Another hint will be that the x-axis of the histogram will contain labels that reflect a quantitative variable, bar charts will have an x-axis that contains category labels, generally not numbers.

To draw a histogram by hand we would:

  1. Divide the range of data (range is from the smallest to largest value within the data for the variable of interest) into classes of equal width.
  2. Count the number of observations in each class.
  3. Draw the histogram using the horizontal axis as the range of the data values and the vertical axis for the counts within the class.

Choosing the appropriate display

When selecting a visual display for your data you should first determine how many variables you are going to display and whether they are categorical or quantitative. Then, you should think about what you are trying to communicate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which best communicates your data.


1.3 - Minitab: Summarizing Data Visually

1.3 - Minitab: Summarizing Data Visually

Minitab®  – Bar, Pie Charts, & Histograms

Steps for Creating a Pie Chart

  1. In Minitab choose Graph > Pie Chart.
  2. Choose one of the following, depending on the format of your data:
  3. In Category names, enter the column of categorical data that defines the groups.
  4. In Summary values, enter the column of summary data that you want to graph.
  5. Choose OK.

Steps for Creating a Bar Chart

  1. In Minitab choose Graph > Bar Chart.
  2. Choose one of the following, depending on the format of your data:
    1. Counts of unique values (This is the best option). Choose Simple for the graph type.
    2. A function of a variable
    3. Values from a table
  3. Click OK

Steps for Creating Histograms

  1. How to create a histogram in Minitab:
  2. Click Graph > Histogram
  3. Choose Simple.
  4. Enter the column with your variable
  5. Click OK

1.4 - Measures of Central Tendency

1.4 - Measures of Central Tendency

  The ability to visually summarize data is effective, but someone like Maria will probably need to present some numerical summaries of her data to use in her reporting. The most common measures to describe data are measures of central tendency.

Mean, Median, Mode

A measure of central tendency is an important aspect of quantitative data. It is an estimate of a “typical” value. Maria may be asked for the typical number of children seen per month.

Three of the many ways to measure central tendency are the mean, median and mode.

There are other measures, such as a trimmed mean, that we do not discuss here.

Mean
The mean is the average of data.

NOTE: At this point, we are going to start to use some basic notation to represent numbers as we present formulas and ways of calculating.  When you read "Let (some confusing symbols) represent" we are trying to convey the formula in a "generic" way.  If this gets confusing, skim over the formulas and pay more attention to the detailed example below!)

Let \(x_1, x_2, \ldots, x_n\) be our sample.  (As per the previous note, all we are doing is having the  \(x_1, x_2, \ldots, x_n\) represent numbers.  We could have easily illustrated this with real values such as (1,2,3,4 and 5)

The sample mean is usually denoted by \(\bar{x}\)  (If you are following this correctly, for the values of 1,2,3,4, and 5)\(\bar{x}\)  would be 3!)

\(\bar{x}=\sum_{i=1}^n \dfrac{x_i}{n}=\dfrac{1}{n}\sum_{i=1}^n x_i\)

where n is the sample size and \(x_i\) are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.

Is this notation confusing you?  Don't let it get to you.  If this is not intuitive focus on the concepts of what the formulas are doing.  (in this example, we are adding all of the numbers (represented by the big squiggly E) and dividing by the total number of observations!

Quite simply, Maria would simply calculate the average number of children per month.

The sample mean (\(\bar{x}\)) is a  statistic and a population mean (\(\mu\)) is a  parameter.
Note on Notation

What if we say we used \(y_i\) for our measurements instead of \(x_i\)? Is this a problem? No. The formula would simply look like this: \(\bar{y}=\sum_{i=1}^n \dfrac{y_i}{n}=\dfrac{1}{n}\sum_{i=1}^n y_i\)

The formulas are exactly the same. The letters that you select to denote the measurements are up to you. For instance, many textbooks use \(y\) instead of \(x\) to denote the measurements. The point is to understand how the calculation that is expressed in the formula works. In this case, the formula is calculating the mean by summing all of the observations and dividing by the number of observations. There is some notation that you will come to see as standards, i.e, n will always equal sample size. We will make a point of letting you know what these are. However, when it comes to the variables, these labels can (and do) vary.

Median

The median is the middle value of the ordered data. Maria might be asked to report the median if she had one or two months with extremely larger or small numbers of children seen at the agency.

The most important step in finding the median is to first order the data from smallest to largest.

Steps to finding the median for a set of data:

  1. Arrange the data in increasing order, i.e. smallest to largest.
  2. Find the location of the median in the ordered data by \(\frac{n+1}{2}\), where n is the sample size.
  3. The value that represents the location found in Step 2 is the median.
Note on Odd or Even Sample Sizes
If the sample size is an odd number then the location point will produce a median that is an observed value. If the sample size is an even number, then the location will require one to take the mean of two numbers to calculate the median. The result may or may not be an observed value as the example below illustrates.
Mode
The mode is the value that occurs most often in the data. It is important to note that there may be more than one mode in the dataset. For Maria, the mode would be the month(s) with the largest number of children seen

Example 1-2: SAT Data

From an SAT data set, we get the following participation rates for the nine South Atlantic states (Region is SA): 74, 79, 65, 75, 71, 74, 64, 73, and 20. In order to find the median we must first rank the data from smallest to largest:

20, 64, 65, 71, 73, 74, 74, 75, 79

To find the middle point we take the number of observations plus one and divide by two. Mathematically this looks like this where n is the number of total observations:

\(\dfrac{n+1}{2}=\dfrac{9+1}{2}=5\)

Returning to the ordered string of data, the fifth observation is 73. Thus the median of this distribution is 73. The interpretation of the median is that 50% of the observations fall at or below this value and 50% fall at or above this value. In this example, this would mean that 50% of the observations are at or below 73 and 50% are at or above 73. If another value was observed, say 88, this would bring the number of observations to ten. Using the formula above to find the middle point would be at 5.5 (10 plus 1 divided by 2). Here we would find the median by taking the average of the fifth and sixth observations which would be the average of 73 and 74. The new median for these ten observations would be 73.5. As you can see, the median value is not always an observed value of the data set.

To find the mean, we simply add all of the numbers and then divide this total by total numbers summed. Mathematically this looks like this where again n is the number of observations:

\(\bar{x}=\dfrac{\sum^n_{i=1}x_i}{n}=\dfrac{74+79+65+75+71+74+64+73+20}{9}=66.11\)

Effects of Outliers

One shortcoming of the mean is that means are easily affected by extreme values. Measures that are not that affected by extreme values are called resistant. Measures that are affected by extreme values are called sensitive. As stated, Maria would use the median if she felt her numbers were could be impacted by outliers because the median is resistant to outliers.

Adding and Multiplying Constants

What happens to the mean and median if we add or multiply each observation in a data set by a constant?

Consider for example if an instructor curves an exam by adding five points to each student’s score. What effect does this have on the mean and the median? The result of adding a constant to each value has the intended effect of altering the mean and median by the constant.

For example, if in the above example where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be 87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).

Similarly, if each observed data value was multiplied by a constant, the new mean and median would change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were doubled, the then the new mean and new median would be double the original mean and median. As we will learn shortly, the effect is not the same on the variance!

Shape and Central Tendency

The shape of the data helps us to determine the most appropriate measure of central tendency. The three most important descriptions of shape are Symmetric, Left-skewed, and Right-skewed. Skewness is a measure of the degree of asymmetry of the distribution. Maria might want to examine the shape of the distribution of the number of children seen.

 

Symmetric

  • mean, median, and mode are all the same here
  • no skewness is apparent
  • the distribution is described as symmetric
A symmetrical distribution.
Mean = Median = Mode Symmetrical

Left-Skewed or Skewed Left

  • mean < median
  • long tail on the left
A left skewed distribution.
Median Mean Mode Skewed to the left

Right-skewed or Skewed Right

  • mean > median
  • long tail on the right
A right skewed distribution.
Median Mean Mode Skewed to the right
Note! When one has very skewed data, it is better to use the median as a measure of central tendency since the median is not much affected by extreme values.

Uses and Abuses of Summaries

  Descriptive statistics allow Maria to show her data using pictures, however as pointed out with the pie chart, not all presentations accurately portray the data. Since Maria is also balancing her reporting obligations to her funding needs, she might be tempted to present her data to convey very high usage rates or successes for her services. To avoid the temptation to misuse or misrepresent data, Maria needs to consider some of the ethics in statistics.


1.5 - Ethical Considerations

1.5 - Ethical Considerations

Like in Maria’s case, you now may be thinking, have I ever misused data or read a report that data may have been misused? This thinking is part of the process of “ethics spotting”, the first step to realizing data may have some ethical issues. Fortunately, the American Statistical Association (ASA) publishes guidelines on ethics to guide us to appropriate courses of action when working with data. While there are not necessarily right or wrong answers to these questions, the fact that you are now thinking about ethics in data is the important part.

American Statistical Association's Committee on Professional Ethics published the following: Ethical Guidelines for Statistical Practice

From their website:

The Ethical Guidelines address eight general topic areas and specify important ethical considerations under each topic.

  1. Professionalism points out the need for competence, judgment, diligence, self-respect, and worthiness of the respect of other people.
  2. Responsibilities to Funders, Clients, and Employers discusses the practitioner's responsibility for assuring that statistical work is suitable to the needs and resources of those who are paying for it, that funders understand the capabilities and limitations of statistics in addressing their problem, and that the funder's confidential information is protected.
  3. Responsibilities in Publications and Testimony addresses the need to report sufficient information to give readers, including other practitioners, a clear understanding of the intent of the work, how and by whom it was performed, and any limitations on its validity.
  4. Responsibilities to Research Subjects describes requirements for protecting the interests of human and animal subjects of research-not only during data collection but also in the analysis, interpretation, and publication of the resulting findings.
  5. Responsibilities to Research Team Colleagues addresses the mutual responsibilities of professionals participating in multidisciplinary research teams.
  6. Responsibilities to Other Statisticians or Statistical Practitioners notes the interdependence of professionals doing similar work, whether in the same or different organizations. Basically, they must contribute to the strength of their professions overall by sharing nonproprietary data and methods, participating in peer review, and respecting differing professional opinions.
  7. Responsibilities Regarding Allegations of Misconduct addresses the sometimes painful process of investigating potential ethical violations and treating those involved with both justice and respect.
  8. Responsibilities of Employers, Including Organizations, Individuals, Attorneys, or Other Clients Employing Statistical Practitioners encourages employers and clients to recognize the highly interdependent nature of statistical ethics and statistical validity. Employers and clients must not pressure practitioners to produce a particular "result," regardless of its statistical validity. They must avoid the potential social harm that can result from the dissemination of false or misleading statistical work.

 Given these guidelines, we might consider some issues Maria might face. If she only has 2 children at her organization being treated for an eating disorder, presenting this information along with the neighborhood the children live in might violate her responsibilities to protect the identities of her clients. While Maria did not intend to disclose anyone’s identity the way in which she presented the basic descriptive statistics lead to unethical action.


1.6 - Summary

1.6 - Summary

 It is critical to know what kind of data you have, as just about every subsequent decision you make around statistics is based on this initial identification. Maria needs to understand her basic descriptive information about her clients in order to avoid exposure to anyone. She also needs to be applying the appropriate graphs for her categorical and quantitative data.

So what does all of this mean? We might tell Maria to make sure she really understands the type of data she has and how she is planning on presenting it in light of any possible ethical consideration. The way Maria chooses to describe her data can have an impact on the kinds of conclusions drawn.

So in dealing with data, not only must we be technically correct in determining the type of data we have and matching the appropriate descriptive statistics and graphical representations, we also must do so in a manner that accurately represents our phenomena and not allow our own biases and perspectives bend the data. Finally, as a data consumer, you should become more aware to the possibilities of misrepresentation of data, the material in this course will facilitate you learning critical questions as you harness the incredible power and influence of statistics.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility