5.2 - Correlation & Significance5.2 - Correlation & Significance
This lesson expands on the statistical methods for examining the relationship between two different measurement variables. Remember that overall statistical methods are one of two types: descriptive methods (that describe attributes of a data set) and inferential methods (that try to draw conclusions about a population based on sample data).
Many relationships between two measurement variables tend to fall close to a straight line. In other words, the two variables exhibit a linear relationship. The graphs in Figure 5.2 and Figure 5.3 show approximately linear relationships between the two variables.
It is also helpful to have a single number that will measure the strength of the linear relationship between the two variables. This number is the correlation. The correlation is a single number that indicates how close the values fall to a straight line. In other words, the correlation quantifies both the strength and direction of the linear relationship between the two measurement variables. Table 5.1 shows the correlations for data used in Example 5.1 to Example 5.3. (Note: you would use software to calculate a correlation.)
|Example||Variables||Correlation ( r )|
|Example 5.1||Height and Weight||\(r = .541\)|
|Example 5.2||Distance and Monthly Rent||\(r = -.903\)|
|Example 5.3||Study Hours and Exercise Hours||\(r = .109\)|
Watch the movie below to get a feel for how the correlation relates to the strength of the linear association in a scatterplot.
Features of correlation
Below are some features about the correlation.
- The correlation of a sample is represented by the letter r.
- The range of possible values for a correlation is between -1 to +1.
- A positive correlation indicates a positive linear association like the one in example 5.8. The strength of the positive linear association increases as the correlation becomes closer to +1.
- A negative correlation indicates a negative linear association. The strength of the negative linear association increases as the correlation becomes closer to -1.
- A correlation of either +1 or -1 indicates a perfect linear relationship. This is hard to find with real data.
- A correlation of 0 indicates either that:
- there is no linear relationship between the two variables, and/or
- the best straight line through the data is horizontal.
- The correlation is independent of the original units of the two variables. This is because the correlation depends only on the relationship between the standard scores of each variable.
- The correlation is calculated using every observation in the data set.
- The correlation is a descriptive result.
As you compare the scatterplots of the data from the three examples with their actual correlations, you should notice that findings are consistent for each example.
- In Example 5.1, the scatterplot shows a positive association between weight and height. However, there is still quite a bit of scatter around the pattern. Consequently, a correlation of .541 is reasonable. It is common for a correlation to decrease as sample size increases.
- In Example 5.2, the scatterplot shows a negative association between monthly rent and distance from campus. Since the data points are very close to a straight line it is not surprising the correlation is -.903.
- In Example 5.3, the scatterplot does not show any strong association between exercise hours/week and study hours/week. This lack of association is supported by a correlation of .109.
A statistically significant relationship is one that is large enough to be unlikely to have occurred in the sample if there's no relationship in the population. The issue of whether a result is unlikely to happen by chance is an important one in establishing cause-and-effect relationships from experimental data. If an experiment is well planned, randomization makes the various treatment groups similar to each other at the beginning of the experiment except for the luck of the draw that determines who gets into which group. Then, if subjects are treated the same during the experiment (e.g. via double blinding), there can be two possible explanations for differences seen: 1) the treatment(s) had an effect or 2) differences are due to the luck of the draw. Thus, showing that random chance is a poor explanation for a relationship seen in the sample provides important evidence that the treatment had an effect.
The issue of statistical significance is also applied to observational studies - but in that case, there are many possible explanations for seeing an observed relationship, so a finding of significance cannot help in establishing a cause-and-effect relationship. For example, an explanatory variable may be associated with the response because:
- Changes in the explanatory variable cause changes in the response;
- Changes in the response variable cause changes in the explanatory variable;
- Changes in the explanatory variable contribute, along with other variables, to changes in the response;
- A confounding variable or a common cause affects both the explanatory and response variables;
- Both variables have changed together over time or space; or
- The association may be the result of coincidence (the only issue on this list that is addressed by statistical significance).
Remember the key lesson: correlation demonstrates association - but the association is not the same as causation, even with a finding of significance.