8.1 - Statistical Analysis

Printer-friendly versionPrinter-friendly version

Count Data as Proportions

Tables are often reduced to proportions. This is helpful in interpreting the tables and for visualization, but for statistical analysis we need the actual counts.

Below we're showing the row proportions - for example, 0.42=20/47.

  F M NA Row Sum
ALL

20
0.42

21
0.45
6
0.13
47
AML 3
0.12
5
0.20
17
0.68
25
Total 23 26 23

72

You can see immediately that the NA are a much bigger proportion of the AML samples compared to the ALL samples. This is obvious from the numbers; however it is often more difficult to compare the raw counts when they are large,but quite simple to compare the proportions.

We also look at the column proportions - e.g. 0.87=20/23.  Here we can see that they are not so different for males and females but they are very different for the NA group.

  F M NA Row Sum
ALL

20
0.87

21
0.81
6
0.26
47
AML 3
0.13
5
0.19
17
0.74
25
Total 23 26 23

72

Finally, we have the cell entries as a proportion of the entire number of samples - e.g. 0.28=20/72.

  F M NA Row Sum
ALL

20
0.28

21
0.29
6
0.08
47
AML 3
0.04
5
0.07
17
0.24
25
Total 23 26 23

72

One of the questions that we often ask in various ways is, Are the rows independent of the columns? For example, Is a certain cancer type more prevalent in males or females? If the proportions  are about the same in males and females then cancer type is independent of gender and if it is not the same then cancer type is dependent on gender.

  F M Row Sum
ALL 20 21 41
AML 3 5 8
Total 23 26 49

One measurement of dependence in tables is called the odds ratio. It can be computed for any 2 rows and columns of a table.  The odds ratios a measure of dependence between the rows in the columns in a 2 × 2 table; in a bigger table, there are multiple choices of the two rows and columns giving a more complex picture of dependence.  In the 4 selected cells, let \(p_1\) be the row proportion of the top left cell, and let \(p_2\) be the row proportion of the bottom left cell.  Then the odds ratio O is:

\[O=\frac{p_1}{1-p_1}/ \frac{p_2}{1-p_2}.\]

Alternatively, \(p_1\) can be the column proportion of the top left cell, in which case \(p_2\) should be the column proportion of the top right cell - the odds ratio will be the same. If the rows and columns are exactly independent the odds ratio will = 1. However, due to variability in the data, we usually need to assess if the odds ratio is significantly different from 1.

Here is the odds ratio for this table.

\[\begin{align}
O&=\frac{p_1}{1-p_1}/ \frac{p_2}{1-p_2}\\
&= \frac{20/23}{3/23}/\frac{21/26}{5/26}\\
&= \frac{20}{3} \times \frac{5}{21}\\
&= \frac{20/41}{21/41} / \frac{3/8}{5/8}\\
&= 1.57\\
\end{align}\]

The odds ratio equals 1.57. What does this mean? Does it mean that gender and cancer type are dependent, since the ratio is not 1.0? Or is 1.57 in the range of typical values that you would expect from natural variability, even when gender and cancer type are independent?  The chi-squared test (when the cell counts are large) and Fisher's Exact test are both tests of the null hypothesis of independence of the rows and columns, or equivalently that the population odds ratio is 1.0

Tests of Independence in Tabular Data - Chi-squared test

If you  took a course in basic statistics you probably learn the chi-squared test. The chi-squared, \(\chi^2\), test of independence tests whether the row proportions depend on the columns, or the columns on the rows. If there is no dependence, the proportions in each column should be about the same. This test is based on the expected proportions in each cell in the table when the rows and columns are independent.  This is converted to expected counts in each  cell based on the total number of items in the table. Then the deviation between the expected and observed counts is computed.  It is important to use the deviation of counts, rather than the deviation of proportions, because the larger the number of items in the cell, the larger the SD of the counts.  The chi-squared test is based on whether the observed deviations are bigger than expected by chance.

  F M NA Row Sum
ALL

20
15.01

21
16.97
6
15.01
47
0.65
AML 3
7.99
5
9.03
17
7.99
25
0.35
Total 23 26 23

72

The best guess would be that 65% of all possible leukemia cases in these hospitals are ALL and and 35% are AML. Therefore if cancer type does not depend on gender, we would expect 65% of the 23 women to have ALL and the remainder to have AML, i.e. \(E(ALL, F)=23 \times 47/72=15.01\). Similarly, 65% of the men are expected to have ALL and 35% to have AML, and finally the same proportions should be in the NA column.  These proportions are used to compute the expected values in each cell.

The \(\chi^2\) test statistic formula is then:

\[C^2=\sum_i\sum_j\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

where \(O_{ij}\)  is the observed count and \(E_{ij}\) is the expected count in row i and column j.. When all of the expected values are larger than five and the column proportions are independent of row, (or row proportions independent of columns), the distribution of \(C^2\) is approximately \(\chi^2\) with (r - 1)(c - 1) d.f. where r is the number of rows and c is the number of columns.

In this case the chi-squared statistic, \(C^2\) = 23.1074 with 2 d.f. and a p-value = 9.6e-06. This is highly significant. But we can also see where the deviations are, and the largest deviations are in the NA column. In this column we do not know the gender.   So this is not the interesting column!

One of the interesting things about a chi-squared statistic is that you can take out a whole row or a whole column and redo the test. When we do this and remove the NA column we encounter 2 problems.  Firstly, because of the small expected counts for AML, the distribution of the test statistic under the null hypothesis is not chi-squared, so our computed p-values are inaccurate. Secondly, because we want to make inference about the association about disease type and gender, and because we know about 1/3 of the data did not include gender, we have to acknowledge that there might be a problem with our statistical conclusions.

To deal with the problem of small expected counts, a correction to the test called the Yates continuity correction provides more accurate p-values.  However, usually when there are some cells with small expected values we use another test, Fisher's exact test also called the Hypergeometric test.  It is explained in the next subsection.

  F M Row Sum
ALL 20
19.24
21
21.76
41
0.84
AML 3
3.76
5
4.24
8
0.16
Total 23 26 49

The Pearson's chi-squared test with Yates continuity correction gives:

\(C^2\)= 0.039 with 1 d.f. and  p-value = 0.8434.

This is a very small chi-squared statistic with a very large p-value, so we conclude there is no evidence of a gender effect.

The \(\chi^2\) distribution is a continuous distribution while tabular data is discrete.  For example, in our case there are two rows, two columns and 49 patients, so there is only a finite number of possible values of \(C^2\). The number of possible values is quite small because the test is actually conditional on the row and column sums.  (That means that the only values of the test statistic that are considered come from tables with the same "Total" row and the same "Row Sum" column as the table above. ) Because of this, if we put a number between 0 and 49 in any cell of the table, we know all of the other entries by subtraction.  That is why we have 1 d.f.

The \(\chi^2\) test is what is called an asymptotic test. The distribution of the test statistic under the null distribution is an approximation which is accurate for very large samples. The Yates continuity correction allows accuracy for slightly smaller samples.  When the d.f. are larger, the rule of thumb that the expected values have to be greater than 5 can be relaxed as long as only a few cells violate the rule.

 

Tests of Independence in Tabular Data - Fisher's Exact Test

If the expected counts are small, an exact p-value can be computed using the Hypergeometric Distribution. This is called Fisher's exact test. It is computationally intensive. Since we have better computing power than Fisher did we often use Fisher's exact test even for large counts, especially in 2 × 2 tables.  However, the p-values from the Chi-squared test are a good approximation to the p-values from Fisher's Exact test when the counts are large.

Like the chi-squared test, Fisher's exact test assumes that the row and column totals are fixed -  to use our example, the total number of ALL and AML cases and the number of females and males in the study are fixed. The test considers all possible tables with these margins.   For example, suppose we let the number of AML female individuals = r.

  F M Row Sum
ALL 23-r 26-(8-r) 41
AML r
8-r 8
Total 23 26 49

Because the row and column sums are fixed, all the other numbers are  known. The probability of this table occurring by chance can then be computed using the Hypergeometric probability:

\[\binom{8}{r}\binom{41}{23-r}/\binom{49}{23}\]

The rationale behind this probability computation is that when we have 49 patients, the number of ways of dividing the patients into a group of 23 and a group of 26 is \(\binom{49}{23}\).  Once we choose our two groups, we can choose r of the 8 AML patients in the first group to be women \(\binom{8}{r}\) different ways and 23-r of the 41 AML patients to be women in   \(\binom{41}{23-r}\) different ways.  So the numerator is the number of different tables with the same margins that have 8 female AML patients and the denominator is the total number of tables with 23 females and 49 patients.  The remarkable thing is that \(\binom{8}{r}\binom{41}{23-r}/\binom{49}{23}=\binom{23}{r}\binom{26}{8-r}/\binom{49}{8}\) so that the same probability is reached counting males and females or AML and ALL cases.

To compute the p-value, we compute the probabilities of each of the possible tables.  A table is "at least as extreme" as the observed table if its probability is as small or smaller than the observed table.  In this case, the tables that are at least as extreme as the observed table are those are tables with r ≠ 4.  The p-value is the probability of observing a table at least as extreme as the observed table, which is the sum of the probabilities of all the extreme tables.

The probability distribution used is called the hypergeometric distribution, so the test is sometimes called the HyperG test.The results of the HyperG test in R are shown in the table below, along with the results from the Chi-squared test.

Fisher's Exact Test

p-value = 0.7065

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:
0.2647779, 11.4700534

sample estimates:
odds ratio = 1.572593

Chi-squared approximation

may be incorrect (because of the small expected values for AML)

Pearson's Chi-squared test with Yates' continuity correction:

X-squared = 0.039, df = 1,

p-value = 0.8434

You can see that the p-values of these two different methods are not that close, although both lead to the conclusion that gender and leukemia type are independent. When all of the expected values in the table are over five, the p-values obtained from the two tests are very close.

There are only nine tables in this case so it is easy to compute them all.  We can see this because the smallest fix margin is 8.  The cells in this row must add up to 8, so the only possible values of r are 0 through 8.  If the smallest margin instead was 20 million as it might be in an RNA Seq data set you are going to have an awful lot of different tables to add up! That is why we use the chi-squared test.  The software that we are going to use transitions smoothly between the two tests depending on the size of the smallest margin.