So far, we've referred to both gamma and tau-b as "correlation-like" in that they measure both the strength and the direction of an association. The correlation in reference is **Pearson's Correlation**, or **Pearson's correlation coefficient**. As a parameter, Pearson's correlation is defined for two (quantitative) random variables Y and Z as

\( \rho=\dfrac{Cov(Y,Z)}{\sigma_Y\sigma_Z} \)

where \(Cov(Y,Z)=E[(Y-\mu_Y)(Z-\mu_Z)]\) is the covariance of \(Y\) and \(Z\). To see intuitively how this measures linear association, notice that \((Y-\mu_Y)(Z-\mu_Z)\) will be positive if \(Y\) and \(Z\) tend to be large (greater than their means) or tend to be small (less than their means) together. Dividing by their standard deviations removes the units involved with the variables results in a quantity that always falls within \([-1,1]\). The sample estimate of \(\rho\) is

\(r = \dfrac{(n-1)^{-1}\sum_{i=1}^n(Y_i-\overline{Y})(Z_i-\overline{Z})}{s_Ys_Z} \)

where \(s_Y\) and \(s_Z\) are the sample standard deviations of \(Y\) and \(Z\), respectively. Both the population \(\rho\) and the sample estimate \(r\) satisfy the following properties:

- \(−1 \le r \le 1\)
- \(r = 0\) corresponds to no (linear) relationship
- \(r = \pm1\) corresponds to perfect association, which for a two-way (square) table means that all observations fall into the diagonal cells.

When classification is ordinal, we can sometimes assign quantitative or numerically meaningful scores to the categories, which allows Pearson's correlation to be defined and estimated. The null hypothesis of interest in such a situation is whether the correlation is 0, meaning specifically that there is no linear trend among the ordinal levels. If the null hypothesis of (linear) independence is rejected, it is natural and meaningful to then measure the linear trend. It should be noted that the correlation considered is defined from the scores of the categories. Different choices of scores will generally result in different correlation measures.

A common approach to assigning scores to \(J\) ordered categories is to use their ranks: \(1,2,\ldots,J\) from least to greatest. Pearson's correlation formula applied to this choice of scores is known as **Spearman's correlation (Spearman's rho)**. Even when the variables are originally quantitative, Spearman's rho is a popular non-parametric choice because it's not influenced by outliers. For example, the values 3.2, 1.8, and 4.9 have the same ranks (2, 1, 3) as the values 3.2, 1.8 and 49.0. We've encountered this idea of ranks already in our discussion of cumulative odds, but the difference here is that scores are treated as numerically meaningful and, if two variables are scored as such, will define a correlation that measures linear association.

For the 2018 GSS survey data relating *jobsecok* to *happy*, we can assign the scores (1,2,3,4) going down the rows to indicate increasing agreement (to the statement on job security) and scores (1,2,3) going across the columns from left to right to indicate increasing happiness.

1 | 2 | 3 | |
---|---|---|---|

1 | 15 | 25 | 5 |

2 | 21 | 47 | 21 |

3 | 64 | 248 | 100 |

4 | 73 | 474 | 311 |

The correlation calculated from these scores is then 0.1948, which represents a relatively weak positive, linear association.

##
Assigning Scores
Section* *

As an alternative to scores based on ranks, we may have another choice that appropriately reflects the nature of the ordered categories. In general, we denote scores for the categories of the row variable by \(u_1 \le u_2 \le\cdots\le u_I\) and of the column variable by \(v_1\le v_2 \cdots\le v_J\). This defines a correlation as a measure of linear association between these two scored variables and can be estimated with the same formula used for Pearson's correlation. For an \(I\times J\) table with cell counts \(n_{ij}\), this estimate is

\(r=\dfrac{\sum_i\sum_j(u_i-\bar{u})(v_j-\bar{v})n_{ij}}{\sqrt{[\sum_i\sum_j(u_i-\bar{u})^2 n_{ij}][\sum_i\sum_j(v_j-\bar{v})^2n_{ij}]}}\)

where \(\bar{u}=\sum_i\sum_j u_i n_{ij}/n\) is the row mean, and \(\bar{v}=\sum_i\sum_j v_j n_{ij}/n\) is the column mean.

Note that the assignment of scores is arbitrary and will in general influence the resulting correlation. As a demonstration of this in our current example, suppose we assign the scores (1,2,9,10) to *jobsecok* and the scores (1,9,10) to *happy*. Such a choice would make sense if the response categories were spaced more erratically. The resulting correlation would be \(r=0.1698\).

Thus, to avoid "data-snooping" or artificially significant results, scores should be assigned first, chosen thoughtfully to reflect the nature of the data, and results should be interpreted precisely in terms of the scores that were chosen.

##
Midrank Scores
Section* *

If the row and column totals of the table are highly unbalanced, different scores can produce large differences in the estimated correlation \(r\). One option for assigning scores that accounts for this is to use **midranks**. The idea is based on ranks, but instead of assigning 1, 2, etc. to the categories directly, the ranks are first assigned to the individuals \(1,2,\ldots,n\) for each variable, and then category scores are calculated based on the average rank for the individuals in that category.

For an \(I\times J\) table with \(n_{i+}\) and \(n_{+j}\) representing the row and column totals, respectively, the midrank score for the 1st category of the row variable is

\(u_1=\dfrac{1+2+\cdots+n_{1+}}{n_{1+}} \)

For the second category, we also assign the average rank of the individuals, but note that the ranks begin where they ended in the first category:

\(u_2=\dfrac{(n_{1+}+1)+\cdots+(n_{1+}+n_{2+})}{n_{2+}} \)

And in general, this formula is expressed as

\(u_i=\dfrac{(\sum_{k=1}^{i-1}n_{k+}+1)+\cdots+(\sum_{k=1}^{i}n_{k+})}{n_{i+}} \)

Similarly, midrank scores for the column variable are based on column totals. Ultimately, we end up with \((u_1,\ldots,u_I)\) for the row variable and \((v_1,\ldots,v_J)\) for the column variable, but the values and the spacing of these scores depends on the number of individuals in each category. Let's see how this works for the 2018 GSS example.

Not too happy | Pretty happy | Very happy | |
---|---|---|---|

Not at all true | 15 | 25 | 5 |

Not too true | 21 | 47 | 21 |

Somewhat true | 64 | 248 | 100 |

Very true | 73 | 474 |
311 |

The first row corresponds to the \(n_{1+}=45\) individuals replying "Not at all true". If they were individually ranked from 1 to 45, the average of these ranks would be \(u_1=(1+\cdots+45)/45=23\). Likewise, the \(n_{2+}=89\) individuals in the second row would be ranked from 46 to 134 with average rank \(u_2=(46+\cdots+134)/89=90\). Continuing in this way, we find the midrank scores for *jobsecok *to be \((23,90,340.5,975.5)\) and those for *happy* to be \((87,570.5,1186)\), from which we can calculate the correlation \(0.1849\).