Lesson 5: Relationships Between Measurement Variables

Lesson 5: Relationships Between Measurement Variables

Lesson Overview

In this lesson, we will examine the relationship between measurement variables; how to picture them in scatterplots and understand what those pictures are telling us. The overall goal is to examine whether or not there is a relationship (association) between the variables plotted. In Lesson 6, we will discuss the relationship between different categorical variables.

This chart shows us what kind of graphs we can use for two different variables of the same type. If both variables are categorical, the cluster bar graph is appropriate. For two measurement variables, we can use scatterplot.

Figure 5.1 Variable Types and Related Graphs

Objectives

After successfully completing this lesson you should be able to:

  • Explain the major features of correlation.
  • Identify the key features of a regression line.
  • Apply what it means to be statistically significant.
  • Find the predicted value of y for given choice of x on a regression equation plot.
  • Critique evidence for the strength of an association in observational studies.

5.1 - Graphs for Two Different Measurement Variables

5.1 - Graphs for Two Different Measurement Variables

In a previous lesson, we learned about possible graphs to display measurement data. These graphs included: dotplots, stemplots, histograms, and boxplots view the distribution of one or more samples of a single measurement variable and scatterplots to study two at a time (see section 4.3).

Example 5.1 Graph of Two Measurement Variables

The following two questions were asked on a survey of 220 STAT 100 students:

  1. What is your height (inches)?
  2. What is your weight (lbs)?

Notice we have two different measurement variables. It would be inappropriate to put these two variables on side-by-side boxplots because they do not have the same units of measurement. Comparing height to weight is like comparing apples to oranges. However, we do want to put both of these variables on one graph so that we can determine if there is an association (relationship) between them. The scatterplot of this data is found in Figure 5.2.

The scatterplot with weight as y axis and height as x axis shows positive association between these two variables since weight increases as height increases.

Figure 5.2. Scatterplot of Weight versus Height

In Figure 5.2, we notice that as height increases, weight also tends to increase. These two variables have a positive association because as the values of one measurement variable tend to increase, the values of the other variable also increase. You should note that this holds true regardless of which variable is placed on the horizontal axis and which variable is placed on the vertical axis.

Example 5.2 Graph of Two Measurement Variables

The following two questions were asked on a survey of ten PSU students who live off-campus in unfurnished one-bedroom apartments.

  1. How far do you live from campus (miles)?
  2. How much is your monthly rent (\$)?

The scatterplot of this data is found in Figure 5.3.

The scatterplot with rent as y axis and distance from campus as x axis shows negative association between these two variables since rent decreases as distance increases.

Figure 5.3. Scatterplot of Monthly Rent versus Distance from campus

In Figure 5.3, we notice that the further an unfurnished one-bedroom apartment is away from campus, the less it costs to rent. We say that two variables have a negative association when the values of one measurement variable tend to decrease as the values of the other variable increase.

Example 5.3 Graph of Two Measurement Variables

The following two questions were asked on a survey of 220 Stat 100 students:

  1. About how many hours do you typically study each week?
  2. About how many hours do you typically exercise each week?

The scatterplot of this data is found in Figure 5.4.

The scatterplot with study hours as y axis and exercise hours as x axis shows no association between these two variables since the number of study hours does not increase or decrease as the number of exercise hours increases.

Figure 5.4. Scatterplot of Study Hours versus Exercise Hours

In Figure 5.4, we notice that as the number of hours spent exercising each week increases there is really no pattern to the behavior of hours spent studying including visible increases or decreases in values. Consequently, we say that that there is essentially no association between the two variables.


5.2 - Correlation & Significance

5.2 - Correlation & Significance

This lesson expands on the statistical methods for examining the relationship between two different measurement variables.  Remember that overall statistical methods are one of two types: descriptive methods (that describe attributes of a data set) and inferential methods (that try to draw conclusions about a population based on sample data).

Correlation

Many relationships between two measurement variables tend to fall close to a straight line. In other words, the two variables exhibit a linear relationship. The graphs in Figure 5.2 and Figure 5.3 show approximately linear relationships between the two variables.

It is also helpful to have a single number that will measure the strength of the linear relationship between the two variables. This number is the correlation. The correlation is a single number that indicates how close the values fall to a straight line.  In other words, the correlation quantifies both the strength and direction of the linear relationship between the two measurement variables. Table 5.1 shows the correlations for data used in Example 5.1 to Example 5.3. (Note: you would use software to calculate a correlation.)

Table 5.1. Correlations for Examples 5.1-5.3

Example Variables Correlation ( r )
Example 5.1 Height and Weight \(r = .541\)
Example 5.2 Distance and Monthly Rent \(r = -.903\)
Example 5.3 Study Hours and Exercise Hours \(r = .109\)

 

Watch the movie below to get a feel for how the correlation relates to the strength of the linear association in a scatterplot.

 

Features of correlation

Below are some features about the correlation.

  • The correlation of a sample is represented by the letter r.
  • The range of possible values for a correlation is between -1 to +1.
  • A positive correlation indicates a positive linear association like the one in example 5.8. The strength of the positive linear association increases as the correlation becomes closer to +1.
  • A negative correlation indicates a negative linear association. The strength of the negative linear association increases as the correlation becomes closer to -1.
  • A correlation of either +1 or -1 indicates a perfect linear relationship. This is hard to find with real data.
  • A correlation of 0 indicates either that:
    • there is no linear relationship between the two variables, and/or
    • the best straight line through the data is horizontal.
  • The correlation is independent of the original units of the two variables. This is because the correlation depends only on the relationship between the standard scores of each variable.
  • The correlation is calculated using every observation in the data set.
  • The correlation is a descriptive result.

 

As you compare the scatterplots of the data from the three examples with their actual correlations, you should notice that findings are consistent for each example.

  • In Example 5.1, the scatterplot shows a positive association between weight and height.  However, there is still quite a bit of scatter around the pattern. Consequently, a correlation of .541 is reasonable. It is common for a correlation to decrease as sample size increases.
  • In Example 5.2, the scatterplot shows a negative association between monthly rent and distance from campus. Since the data points are very close to a straight line it is not surprising the correlation is -.903.
  • In Example 5.3, the scatterplot does not show any strong association between exercise hours/week and study hours/week. This lack of association is supported by a correlation of .109.

Statistical Significance

A statistically significant relationship is one that is large enough to be unlikely to have occurred in the sample if there's no relationship in the population. The issue of whether a result is unlikely to happen by chance is an important one in establishing cause-and-effect relationships from experimental data.  If an experiment is well planned, randomization makes the various treatment groups similar to each other at the beginning of the experiment except for the luck of the draw that determines who gets into which group.  Then, if subjects are treated the same during the experiment (e.g. via double blinding), there can be two possible explanations for differences seen: 1) the treatment(s) had an effect or 2) differences are due to the luck of the draw.  Thus, showing that random chance is a poor explanation for a relationship seen in the sample provides important evidence that the treatment had an effect.

The issue of statistical significance is also applied to observational studies - but in that case, there are many possible explanations for seeing an observed relationship, so a finding of significance cannot help in establishing a cause-and-effect relationship.  For example, an explanatory variable may be associated with the response because:

  • Changes in the explanatory variable cause changes in the response;
  • Changes in the response variable cause changes in the explanatory variable;
  • Changes in the explanatory variable contribute, along with other variables, to changes in the response;
  • A confounding variable or a common cause affects both the explanatory and response variables;
  • Both variables have changed together over time or space; or
  • The association may be the result of coincidence (the only issue on this list that is addressed by statistical significance).

Remember the key lesson:  correlation demonstrates association - but the association is not the same as causation, even with a finding of significance.  


5.3 - Key Caveats with Correlations

5.3 - Key Caveats with Correlations

There are three key caveats that must be recognized with regard to correlation.

  1. It is impossible to prove causal relationships with correlation. However, the strength of the evidence for such a relationship can be evaluated by examining and eliminating important alternate explanations for the correlation seen.
  2. Outliers can substantially inflate or deflate the correlation.
  3. Correlation describes the strength and direction of the linear association between variables. It does not describe non-linear relationships

Correlation and Causation

It is often tempting to suggest that, when the correlation is statistically significant, the change in one variable causes the change in the other variable. However, outside of randomized experiments, there are numerous other possible reasons that might underlie the correlation. Thus, it is crucial to evaluate and eliminate the key alternative (non-causal) relationships outlined in section 6.2 to build evidence toward causation.

  1. Check for the possibility that the response might be directly affecting the explanatory variable (rather than the other way around). For example, you might suspect that the number of times children wash their hands might be causally related to the number of cases of the common cold amongst the children at a pre-school. However, it is also possible that children who have colds are made to wash their hands more often. In this example, it would also be important to evaluate the timing of the measured variables - does an increase in the amount of hand washing precede a decrease in colds or did it happen at the same time?
  2. Check whether changes in the explanatory variable contribute, along with other variables, to changes in the response. For example, the amount of dry brush in a forest does not cause a forest fire; but it will contribute to it if a fire is ignited.
  3. Check for confounders or common causes that may affect both the explanatory and response variables. For example, there is a moderate association between whether a baby is breastfed or bottle-fed and the number of incidences of gastroenteritis recorded on medical charts (with the breastfed babies showing more cases). But it turns out that breastfed babies also have, on average, more routine medical visits to pediatricians. Thus, the number of opportunities for mild cases of gastroenteritis to be recorded on medical charts is greater for the breastfed babies providing a clear confounder.
  4. Check whether both variables may have changed together over time or space. For example, data on the number of cases of internet fraud and on the amount spent on election campaigns in the United States taken over the last 30 years would have a strong association merely because they have both increased over time. As another example, if you examine the percent of the population with home computers and the life expectancy for every country in the world, there will be a positive association merely because richer countries have both how life expectancy and greater computer use. Tyler Vigen's website lists thousands of spurious correlations that result from variables that coincidentally change the same way over time.
  5. Check whether the association between the variables might be just a matter of coincidence. This is where a check for the degree of statistical significance would be important. However, it is also important to consider whether the search for significance was a priori or a posteriori. For example, a story in the national news one year reported that at a hospital in Potsdam, New York, 15 babies in a row were all boys. Does that indicate that something at that hospital was causing more male than female births? Clearly, the answer is no, even if the chance of having 15 boys in a row is quite low (about 1 chance in 33,000). But there are over 5000 hospitals in the United States and the story would be just as newsworthy if it happened at any one of them at any time of the year and for either 15 boys in a row or for 15 girls in a row. Thus, it turns out that we actually expect a story like this to happen once or twice a year somewhere in the United States every year.

Example 5.4: Effect of Outliers on Correlation

Below is a scatterplot of the relationship between the Infant Mortality Rate and the Percent of Juveniles Not Enrolled in School for each of the 50 states plus the District of Columbia. The correlation is 0.73, but looking at the plot one can see that for the 50 states alone the relationship is not nearly as strong as a 0.73 correlation would suggest. Here, the District of Columbia (identified by the X) is a clear outlier in the scatter plot being several standard deviations higher than the other values for both the explanatory (x) variable and the response (y) variable. Without Washington D.C. in the data, the correlation drops to about 0.5.

infant mortality scatterplot
Figure 5.5. Scatterplot with outlier

Correlation and Outliers

Correlations measure linear association - the degree to which relative standing on the x list of numbers (as measured by standard scores) are associated with the relative standing on the y list. Since means and standard deviations, and hence standard scores, are very sensitive to outliers, the correlation will be as well.

In general, the correlation will either increase or decrease, based on where the outlier is relative to the other points remaining in the data set. An outlier in the upper right or lower left of a scatterplot will tend to increase the correlation while outliers in the upper left or lower right will tend to decrease a correlation.

Watch the two videos below. They are similar to the video in section 5.2 except that a single point (shown in red) in one corner of the plot is staying fixed while the relationship amongst the other points is changing. Compare each with the movie in section 5.2 and see how much that single point changes the overall correlation as the remaining points have different linear relationships.

 

Even though outliers may exist, you should not just quickly remove these observations from the data set in order to change the value of the correlation. As with outliers in a histogram, these data points may be telling you something very valuable about the relationship between the two variables. For example, in a scatterplot of in-town gas mileage versus highway gas mileage for all 2015 model year cars, you will find that hybrid cars are all outliers in the plot (unlike gas-only cars, a hybrid will generally get better mileage in-town that on the highway).


5.4 - Regression

5.4 - Regression

Regression is a descriptive method used with two different measurement variables to find the best straight line (equation) to fit the data points on the scatterplot. A key feature of the regression equation is that it can be used to make predictions. In order to carry out a regression analysis, the variables need to be designated as either the:

Explanatory or Predictor Variable = x (on horizontal axis)

Response or Outcome Variable = y (vertical axis)

The explanatory variable can be used to predict (estimate) a typical value for the response variable. (Note: It is not necessary to indicate which variable is the explanatory variable and which variable is the response with correlation.)

Review: Equation of a Line

Let's review the basics of the equation of a line:

\(y = a + bx\) where:

a = y-intercept (the value of y when x = 0)

b = slope of the line. The slope is the change in the variable (y) as the other variable (x) increases by one unit. When b is positive there is a positive association, when b is negative there is a negative association.

In this graph, the line is drawn by the equation y equals a plus b times x. The intercept on y axis is a. And y will change by b units as x increases one unit.

Example 5.5: Example of Regression Equation

Consider the following two variables for a sample of ten Stat 100 students.

x = quiz score
y = exam score

Figure 5.6 displays the scatterplot of this data whose correlation is 0.883.

The scatterplot shows the positive relationship between quiz and exam since the exam score increases as the quiz score increases.

Figure 5.6. Scatterplot of Quiz versus exam scores

We would like to be able to predict the exam score based on the quiz score for students who come from this same population. To make that prediction we notice that the points generally fall in a linear pattern so we can use the equation of a line that will allow us to put in a specific value for x (quiz) and determine the best estimate of the corresponding y (exam). The line represents our best guess at the average value of y for a given x value and the best line would be one that has the least variability of the points around it (i.e. we want the points to come as close to the line as possible). Remembering that the standard deviation measures the deviations of the numbers on a list about their average, we find the line that has the smallest standard deviation for the distance from the points to the line. That line is called the regression line or the least squares line. Least squares essentially find the line that will be the closest to all the data points than any other possible line. Figure 5.7 displays the least squares regression for the data in Example 5.5.

In this scatterplot, a regression line has been added with the least squares method. The line slopes upward and passes through some data points.

Figure 5.7. Least Squares Regression Equation

As you look at the plot of the regression line in Figure 5.7, you find that some of the points lie above the line while other points lie below the line. In fact the total distance for the points above the line is exactly equal to the total distance from the line to the points that fall below it.

The least squares regression equation used to plot the equation in Figure 5.7 is:

\begin{align} &y = 1.15 + 1.05 x \text{ or}   \\ &\text{predicted exam score = 1.15 + 1.05 Quiz}\end{align}


Interpretation of Y-Intercept

Y-Intercept = 1.15 points

Y-Intercept Interpretation: If a student has a quiz score of 0 points, one would expect that he or she would score 1.15 points on the exam.

However, this y-intercept does not offer any logical interpretation in the context of this problem, because x = 0 is not in the sample. If you look at the graph, you will find the lowest quiz score is 56 points. So, while the y-intercept is a necessary part of the regression equation, by itself it provides no meaningful information about student performance on an exam when the quiz score is 0.


Interpretation of Slope

Slope = 1.05 = 1.05/1 = (change in exam score)/(1 unit change in quiz score)

Slope Interpretation: For every increase in quiz score by 1 point, you can expect that a student will score 1.05 additional points on the exam.

In this example, the slope is a positive number, which is not surprising because the correlation is also positive. A positive correlation always leads to a positive slope and a negative correlation always leads to a negative slope.


Remember that we can also use this equation for prediction. So consider the following question:

 If a student has a quiz score of 85 points, what score would we expect the student to make on the exam? We can use the regression equation to predict the exam score for the student.

Exam = 1.15 + 1.05 Quiz
Exam = 1.15 + 1.05 (85) = 1.15 + 89.25 = 90.4 points

Figure 5.8 verifies that when a quiz score is 85 points, the predicted exam score is about 90 points.

In this plot, the predicted data point (85, 90.4) is really close to the regression line.

Figure 5.8. Prediction of Exam Score at a Quiz Score of 85 Points

Example 5.6

Let's return now to Example 4.8 the experiment to see the relationship between the number of beers you drink and your blood alcohol content (BAC) a half-hour later (scatterplot shown in Figure 4.8). Figure 5.9 below shows the scatterplot with the regression line included. The line is given by

predicted Blood Alcohol Content = -0.0127 +0.0180(# of beers)

BAC Regression

Figure 5.9. Regression line relating # of beers consumed and blood alcohol content

Notice that four different students taking part in this experiment drank exactly 5 beers. For that group we would expect their average blood alcohol content to come out around -0.0127 + 0.0180(5) = 0.077. The line works really well for this group as 0.077 falls extremely close to the average for those four participants.


5.5 - Two Warnings about Regression

5.5 - Two Warnings about Regression
  1.  First Warning: Avoid Extrapolation

    Do not use the regression equation to predict values of the response variable (y) for explanatory variable (x) values that are outside the range found with the original data. Remember not all relationships are linear (most are not) so when we look at a scatterplot we can only confirm that there is a linear pattern within the range of data at hand. The pattern may very well change shapes outside that range so using a line for extrapolation is inappropriate. With Example 5.4 prediction is restricted to quiz scores that lie between 56 points and 94 points, as shown in Figures 5.8. With Example 5.6, the blood alcohol content is linear in the range of the data. But clearly, the linear pattern can be true for, say 60 beers (the line would predict that your blood is more than 100% alcohol at that point!)

  2.  Second Warning: Logical Interpretation of the y-intercept in the context of a problem

    This is restricted to when you have data where x = 0 is in the sample. For example, the y-intercept for the regression equation in Example 5.6 is -0.0127, but clearly, it is impossible for BAC to be negative.  In fact, in the actual experiment, the police officer taking the BAC measurements using the breathalyzer machine tested all participants before the experiment started to be sure they registered with a BAC = 0. As another example, suppose that you have data from a particular school district that was used to determine a regression equation relating salary (in \$) to years of service (ranging from 0 years to 25 years).   The resulting regression equation is: 

    \(Salary=\$ 29,000+\frac{\$ 1,500}{year} \times (Years\ of\ Service) \)

    Even if you had not been told that "years of service (the x variable)" = 0 was in the sample, you would expect that there would be values with "years of service" = 0 since starting salaries would be in the data set. Therefore, the y-intercept has a logical interpretation of this problem.  However, many samples do not contain x = 0 in the data set and we cannot logically interpret those y-intercepts.


5.6 - Test Yourself!

5.6 - Test Yourself!

Think About It!

Select the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.


5.7 - Have Fun With It!

5.7 - Have Fun With It!

Have Fun With It!

cartoon about outliers, "It's a non-linear pattern with outliers...but for some reason I'm very happy with the data!"

J.B. Landers ©

Correlation Song

lyrics copyright ©2013 by Lawrence Mark Lesser
sing to the tune of the English lullaby "Twinkle Twinkle Little Star" (Jane Taylor)

Are points near a line, or far?
What's the correlation, r?
If the fit supports a line,
Its slope and r would share the sign.
Twinkle, twinkle, you're a star:
Knowing stats will take you far!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility