Lesson 8: Tables and Count Data

Key Learning Goals for this Lesson:

Learn to set up count data in tabular form
Understand Simpson's Paradox and the effects of "lurking variables"
Understand the concept of confounding
Understand the impact of leaving the missing data out of tables
Understand the idea of dependence in 2-way tables.
Become familiar with tests of independence of rows and columns (Fisher's Exact Test (Hypergeometric Test) and the Chi-squared test)

Introduction

Count data are common in many contexts such as

read counts in sequencing studies
the number of genes in selected GO categories or pathways,
the number of genes expressed in certain conditions,
the number of bound sites,
the number of individuals with a genetic variant,
and others

In this lesson we will discuss using tables as a way of showing associations between count data and certain conditions, 'treatments', or other variables. We want to take a look at some of the common effects found in tables that we need to take into account when doing statistical analysis. And most importantly, we will need to take a look at tests of whether row or column counts are proportional, specifically the chi-squared test and the Fisher's exact test, also known as the Hypergeometric test. These tests are used for a variety of purposes in bioinformatics including sequencing studies and gene set enrichment studies.

Here's a typical two-way table of counts from a leukemia study. The table below shows the number of ALL and AML (two types of leukemia) samples from each hospital in the study.

	CALG	CCG	DFCI	St-Jude
ALL	0	0	44	3
AML	15	5	0	5

Tabulated data are a way of showing relationships among counts and one or more variables. In this case, there are two factors, type of cancer and hospital in which the sample was collected. We can immediately see that there is confounding - most of the hospitals submitted only one cancer type. If we are interested in determining differences between the cancer types, no matter whether we are looking at genomics or phenotype, we cannot be sure that the differences are due to the cancer biology. The observed differences could be due to the sample handling by the different hospitals or different patient populations rather than because of biology. The table makes this immediately clear.

Simpson's Paradox

Confounding in two-way tables can lead to an interesting paradox called Simpson's paradox. Simpson's paradox occurs when there is a confounding variable that has not been included in the analysis. A similar paradox can also happen with intensity data, but the paradoxical nature of the outcome is less readily understood in that case.

Let's take a look at an example of Simpson's paradox taken from Wikipedia. The table is from a medical study that compared the success rates of two treatments for kidney stones. The table shows the success rates and the numbers of kidney stones on which they are based.

	Treatment A	Treatment B
Small Stones	93% (81/87)	87% (234/270)
Large Stones	73% (192/263)	69% (55/80)
Both	78% (273/350)	83% (289/350)

In the table above you can see the treatment A is better both for small and large kidney stones. However, if stone size is not taken into account (row labelled "both") then Treatment B is better overall.

Why does this happen? This doesn't have to to do with the number of stones in the sample, but is due to the confounding between the size and the treatment. Notice that the overall success rate for small stones is 88% ( [81+234]/[87+270]) while for large stones it is only 72% ([192+55]/[263+80]). It turns out that the small stones are much easier to treat and most of the patients that got treatment B had small stones, while most of the patients that got treatment A had large stones. Because treatment B was applied to a larger set of more readily cured patients, it comes out ahead overall.

In this example, size is a "lurking variable". If it had not been measured, we would have assumed that treatment B was the better treatment.

Of course, when you see an unexpected outcome, it is often tempting to assume that there is a lurking variable that provides an explanation. In a court case alleging gender bias in pay scales, the defendent successfully claimed that education level was a lurking variable - women were paid less by the defendent because they had less education and therefore were in lower ranked job categories. After the trial, it was shown that in order for the defendent's claim to be true, the male employees would need to have, on average, 12 years of additional education compared to the women, which is highly unlikely. Unfortunately for the women involved (or fortunately for the defendent) the decision was not over-turned.

Some Other Problems with Tabulated Data

There other kinds of problems that happen with count data in tables. The table below, also from the leukemia study has the problem that led to the Challenger disaster!

If you recall, the Challenger space shuttle was due to be launched on a day with temperatures quite a bit lower than any previous launch. Prior to the launch, analysis was done to see whether or not this was a safe temperature for the launch. The problem that had been identified was that the O-rings sealing certain critical valves were thought to fail at low temperatures. However, the data analysts (none of them statisticians, by the way) did not find any relationship between air temperature and O-ring failure. Unfortunately, they were wrong. The O-rings on the Challenger failed, and all aboard the craft perished. Even more tragically, the trend was apparent in the data that was available before the flight. Had the data been handled properly, the launch would have been delayed.

The problem with the Challenger data is also evident in the table below. Do you see it?

	F	M	Row Sum
ALL	20	21	47
AML	3	5	25
Total	23	26	49

We see ALL is more prevalent then AML and that there are about about equal numbers of males and females for each cancer type. However, you probably weren't paying close attention to the numbers of samples in the hospital breakdown of this data. There are 23 samples not listed in this table because gender was not recorded! This is what happened in the Challenger disaster. The data analysts used only the data from the flights in which there was at least one O-ring failure. They did not include in the data table the number of launches in which there were no O-ring failures, all of which were high temperature launches. Had they included all of the launches, they would have noticed that O-ring failure increases with lower temperature.

From the table above, we cannot determine if ALL is the more prevalent cancer. It might just be that ALL was more prevalent in the samples for which gender was recorded. How could this happen? We have already seen that the cancer type was confounded with hospital. Perhaps some hospitals neglected to record gender when they submitted the samples.

Are these problems with statistics? Or with common sense? When people see numbers common sense often goes out the window! In any case, as we saw with the Challenger example, we cannot just ignore missing information - we need to think about what is missing and why.

The table below shows all of the data. ALL is still more common than AML. However, it is notable that more of the AML data were submitted without gender. This could readily be explained if we had the 3-way table that includes gender and hospital, but even from the 2 tables given, it is apparent that hospital CALG did not record gender.

	F	M	NA	Row Sum
ALL	20	21	6	47
AML	3	5	17	25
Total	23	26	23	49

Often there are numerous covariates recorded along with the data and we may not want or need to use them all in the analysis. Nevertheless, It is always important to see all of the data before you start throwing things out so you have a good sense of what you have. For example, with sequencing data an important variable that is often ignored is the number of reads that could not be mapped to a unique feature. In many samples this number is small (although this may not be true for data collected when the technology was new and less accurate). However, in some samples, such as cancer samples, we might expect a fairly large percentage of unusual fragments that may not map to a reference. If we use only the mapped reads, we might not notice the differences among tumors or between tumor and normal cells.

Statisticians differentiate between data that are missing at completely at random, data are missing at random, and informative missingness. If data are missing completely at random, the only effect of the missing data on the statistical analysis is a reduced sample size. If the missing data are informative, then there is a lurking variable and they may have a huge effect on the analysis. Missing at random (as opposed to completely at random) means that the pattern of missing data can be attributed to variables that are observed (such as the hospital which sent the sample), which means that the effects on the analysis can be mitigated. The missing Challenger O-ring data was clearly informative - the data was missing when there were no failures, and this happened for high temperature launches.

Even if we cannot deal statistically with missing data, at least we should note it is missing so that when we interpret the results we can take this into account. There is it a difference between coming up with a p-value and then coming up with an explanation for the results. To fully understand our data, we need to understand both what has been recorded, and what would have been recorded under ideal conditions.

8.1 - Statistical Analysis

Count Data as Proportions

Tables are often reduced to proportions. This is helpful in interpreting the tables and for visualization, but for statistical analysis we need the actual counts.

Below we're showing the row proportions - for example, 0.42=20/47.

	F	M	NA	Row Sum
ALL	20 0.42	21 0.45	6 0.13	47
AML	3 0.12	5 0.20	17 0.68	25
Total	23	26	23	72

You can see immediately that the NA are a much bigger proportion of the AML samples compared to the ALL samples. This is obvious from the numbers; however it is often more difficult to compare the raw counts when they are large,but quite simple to compare the proportions.

We also look at the column proportions - e.g. 0.87=20/23. Here we can see that they are not so different for males and females but they are very different for the NA group.

	F	M	NA	Row Sum
ALL	20 0.87	21 0.81	6 0.26	47
AML	3 0.13	5 0.19	17 0.74	25
Total	23	26	23	72

Finally, we have the cell entries as a proportion of the entire number of samples - e.g. 0.28=20/72.

	F	M	NA	Row Sum
ALL	20 0.28	21 0.29	6 0.08	47
AML	3 0.04	5 0.07	17 0.24	25
Total	23	26	23	72

One of the questions that we often ask in various ways is, Are the rows independent of the columns? For example, Is a certain cancer type more prevalent in males or females? If the proportions are about the same in males and females then cancer type is independent of gender and if it is not the same then cancer type is dependent on gender.

	F	M	Row Sum
ALL	20	21	47
AML	3	5	25
Total	23	26	49

One measurement of dependence in tables is called the odds ratio. It can be computed for any 2 rows and columns of a table. The odds ratios a measure of dependence between the rows in the columns in a 2 × 2 table; in a bigger table, there are multiple choices of the two rows and columns giving a more complex picture of dependence. In the 4 selected cells, let \(p_1\) be the row proportion of the top left cell, and let \(p_2\) be the row proportion of the bottom left cell. Then the odds ratio O is:

\[O=\frac{p_1}{1-p_1}/ \frac{p_2}{1-p_2}.\]

Alternatively, \(p_1\) can be the column proportion of the top left cell, in which case \(p_2\) should be the column proportion of the top right cell - the odds ratio will be the same. If the rows and columns are exactly independent the odds ratio will = 1. However, due to variability in the data, we usually need to assess if the odds ratio is significantly different from 1.

Here is the odds ratio for this table.

\[\begin{align}
O&=\frac{p_1}{1-p_1}/ \frac{p_2}{1-p_2}\\
&= \frac{20/23}{3/23}/\frac{21/26}{5/26}\\
&= \frac{20}{3} \times \frac{5}{21}\\
&= \frac{20/41}{21/41} / \frac{3/8}{5/8}\\
&= 1.57\\
\end{align}\]

The odds ratio equals 1.57. What does this mean? Does it mean that gender and cancer type are dependent, since the ratio is not 1.0? Or is 1.57 in the range of typical values that you would expect from natural variability, even when gender and cancer type are independent? The chi-squared test (when the cell counts are large) and Fisher's Exact test are both tests of the null hypothesis of independence of the rows and columns, or equivalently that the population odds ratio is 1.0

Tests of Independence in Tabular Data - Chi-squared test

If you took a course in basic statistics you probably learn the chi-squared test. The chi-squared, \(\chi^2\), test of independence tests whether the row proportions depend on the columns, or the columns on the rows. If there is no dependence, the proportions in each column should be about the same. This test is based on the expected proportions in each cell in the table when the rows and columns are independent. This is converted to expected counts in each cell based on the total number of items in the table. Then the deviation between the expected and observed counts is computed. It is important to use the deviation of counts, rather than the deviation of proportions, because the larger the number of items in the cell, the larger the SD of the counts. The chi-squared test is based on whether the observed deviations are bigger than expected by chance.

	F	M	NA	Row Sum
ALL	20 15.01	21 16.97	6 15.01	47 0.65
AML	3 7.99	5 9.03	17 7.99	25 0.35
Total	23	26	23	72

The best guess would be that 65% of all possible leukemia cases in these hospitals are ALL and and 35% are AML. Therefore if cancer type does not depend on gender, we would expect 65% of the 23 women to have ALL and the remainder to have AML, i.e. \(E(ALL, F)=23 \times 47/72=15.01\). Similarly, 65% of the men are expected to have ALL and 35% to have AML, and finally the same proportions should be in the NA column. These proportions are used to compute the expected values in each cell.

The \(\chi^2\) test statistic formula is then:

\[C^2=\sum_i\sum_j\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

where \(O_{ij}\) is the observed count and \(E_{ij}\) is the expected count in row i and column j.. When all of the expected values are larger than five and the column proportions are independent of row, (or row proportions independent of columns), the distribution of \(C^2\) is approximately \(\chi^2\) with (r - 1)(c - 1) d.f. where r is the number of rows and c is the number of columns.

In this case the chi-squared statistic, \(C^2\) = 23.1074 with 2 d.f. and a p-value = 9.6e-06. This is highly significant. But we can also see where the deviations are, and the largest deviations are in the NA column. In this column we do not know the gender. So this is not the interesting column!

One of the interesting things about a chi-squared statistic is that you can take out a whole row or a whole column and redo the test. When we do this and remove the NA column we encounter 2 problems. Firstly, because of the small expected counts for AML, the distribution of the test statistic under the null hypothesis is not chi-squared, so our computed p-values are inaccurate. Secondly, because we want to make inference about the association about disease type and gender, and because we know about 1/3 of the data did not include gender, we have to acknowledge that there might be a problem with our statistical conclusions.

To deal with the problem of small expected counts, a correction to the test called the Yates continuity correction provides more accurate p-values. However, usually when there are some cells with small expected values we use another test, Fisher's exact test also called the Hypergeometric test. It is explained in the next subsection.

	F	M	Row Sum
ALL	20 19.24	21 21.76	47 0.84
AML	3 3.76	5 4.24	25 0.16
Total	23	26	49

The Pearson's chi-squared test with Yates continuity correction gives:

\(C^2\)= 0.039 with 1 d.f. and p-value = 0.8434.

This is a very small chi-squared statistic with a very large p-value, so we conclude there is no evidence of a gender effect.

The \(\chi^2\) distribution is a continuous distribution while tabular data is discrete. For example, in our case there are two rows, two columns and 49 patients, so there is only a finite number of possible values of \(C^2\). The number of possible values is quite small because the test is actually conditional on the row and column sums. (That means that the only values of the test statistic that are considered come from tables with the same "Total" row and the same "Row Sum" column as the table above. ) Because of this, if we put a number between 0 and 49 in any cell of the table, we know all of the other entries by subtraction. That is why we have 1 d.f.

The \(\chi^2\) test is what is called an asymptotic test. The distribution of the test statistic under the null distribution is an approximation which is accurate for very large samples. The Yates continuity correction allows accuracy for slightly smaller samples. When the d.f. are larger, the rule of thumb that the expected values have to be greater than 5 can be relaxed as long as only a few cells violate the rule.

Tests of Independence in Tabular Data - Fisher's Exact Test

If the expected counts are small, an exact p-value can be computed using the Hypergeometric Distribution. This is called Fisher's exact test. It is computationally intensive. Since we have better computing power than Fisher did we often use Fisher's exact test even for large counts, especially in 2 × 2 tables. However, the p-values from the Chi-squared test are a good approximation to the p-values from Fisher's Exact test when the counts are large.

Like the chi-squared test, Fisher's exact test assumes that the row and column totals are fixed - to use our example, the total number of ALL and AML cases and the number of females and males in the study are fixed. The test considers all possible tables with these margins. For example, suppose we let the number of AML female individuals = r.

	F	M	Row Sum
ALL	23-r	26-(8-r)	41
AML	r	8-r	8
Total	23	26	49

Because the row and column sums are fixed, all the other numbers are known. The probability of this table occurring by chance can then be computed using the Hypergeometric probability:

\[\binom{8}{r}\binom{41}{23-r}/\binom{49}{23}\]

The rationale behind this probability computation is that when we have 49 patients, the number of ways of dividing the patients into a group of 23 and a group of 26 is \(\binom{49}{23}\). Once we choose our two groups, we can choose r of the 8 AML patients in the first group to be women \(\binom{8}{r}\) different ways and 23-r of the 41 AML patients to be women in \(\binom{41}{23-r}\) different ways. So the numerator is the number of different tables with the same margins that have 8 female AML patients and the denominator is the total number of tables with 23 females and 49 patients. The remarkable thing is that \(\binom{8}{r}\binom{41}{23-r}/\binom{49}{23}=\binom{23}{r}\binom{26}{8-r}/\binom{49}{8}\) so that the same probability is reached counting males and females or AML and ALL cases.

To compute the p-value, we compute the probabilities of each of the possible tables. A table is "at least as extreme" as the observed table if its probability is as small or smaller than the observed table. In this case, the tables that are at least as extreme as the observed table are those are tables with r ≠ 4. The p-value is the probability of observing a table at least as extreme as the observed table, which is the sum of the probabilities of all the extreme tables.

The probability distribution used is called the hypergeometric distribution, so the test is sometimes called the HyperG test.The results of the HyperG test in R are shown in the table below, along with the results from the Chi-squared test.

Fisher's Exact Test

p-value = 0.7065

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:
0.2647779, 11.4700534

sample estimates:
odds ratio = 1.572593

Chi-squared approximation

may be incorrect (because of the small expected values for AML)

Pearson's Chi-squared test with Yates' continuity correction:

X-squared = 0.039, df = 1,

p-value = 0.8434

You can see that the p-values of these two different methods are not that close, although both lead to the conclusion that gender and leukemia type are independent. When all of the expected values in the table are over five, the p-values obtained from the two tests are very close.

There are only nine tables in this case so it is easy to compute them all. We can see this because the smallest fix margin is 8. The cells in this row must add up to 8, so the only possible values of r are 0 through 8. If the smallest margin instead was 20 million as it might be in an RNA Seq data set you are going to have an awful lot of different tables to add up! That is why we use the chi-squared test. The software that we are going to use transitions smoothly between the two tests depending on the size of the smallest margin.

8.2 - Uses of Tabular Data in Bioinformatics

We will be using these types of tests many times throughout this course. We will be using them for

gene set enrichment analysis,
RNA-seq differential expression analysis, and
SNP frequency analysis.

There are some differences in how these tests are used. However, in all cases, multiple tests are done, so that multiple testing methods must also be used.

In gene set enrichment analysis we look at whether a list of features (usually obtained from a literature search or an upstream analysis such as differential expression analysis) appears to be enriched or depleted compared to some known list (e.g. genes known to be involved in oxidative stress). One margin of the table is whether the feature was or was not selected from a reference set of features. The other margin is whether the feature is or is not on the known list. Often the known lists are in some type of nested arrangement - for example, in gene ontology analysis the ontology terms become more and more specific, so the list of features in each set are subsets of each other. Multiple testing adjustments need to take the nesting into account.

In RNA-seq differential expression analysis we look at how many reads from the library are mapped to a specific feature, and how many are not, for each of our treatments. However, just as in microarray differential expression, to make biological conclusions we need biological replication. Since the library sizes and proportion of reads from each feature varies from sample to sample, we cannot simply add across the samples to obtain a single table. We will use a method that can be considered either an extension of t-tests or an extension of Fisher's exact test to handle this situation.

In SNP frequency analysis there is a 2 by 3 table for each locus with a genetic variant. The margin with three entries is the number of minor alleles (0,1,or 2 or alternatively aa, aA or AA). The other margin is the treatment. In this case, Fisher's exact test or the chi-squared test are appropriate for each table. However, the tables are correlated due to linkage disequilibrium between nearby loci. This means that appropriate adjustments must be made for multiple testing with correlated data.