Lesson 8: Tables and Count Data
|Key Learning Goals for this Lesson:|
Count data are common in many contexts such as
- read counts in sequencing studies
- the number of genes in selected GO categories or pathways,
- the number of genes expressed in certain conditions,
- the number of bound sites,
- the number of individuals with a genetic variant,
- and others
In this lesson we will discuss using tables as a way of showing associations between count data and certain conditions, 'treatments', or other variables. We want to take a look at some of the common effects found in tables that we need to take into account when doing statistical analysis. And most importantly, we will need to take a look at tests of whether row or column counts are proportional, specifically the chi-squared test and the Fisher's exact test, also known as the Hypergeometric test. These tests are used for a variety of purposes in bioinformatics including sequencing studies and gene set enrichment studies.
Here's a typical two-way table of counts from a leukemia study. The table below shows the number of ALL and AML (two types of leukemia) samples from each hospital in the study.
Tabulated data are a way of showing relationships among counts and one or more variables. In this case, there are two factors, type of cancer and hospital in which the sample was collected. We can immediately see that there is confounding - most of the hospitals submitted only one cancer type. If we are interested in determining differences between the cancer types, no matter whether we are looking at genomics or phenotype, we cannot be sure that the differences are due to the cancer biology. The observed differences could be due to the sample handling by the different hospitals or different patient populations rather than because of biology. The table makes this immediately clear.
Confounding in two-way tables can lead to an interesting paradox called Simpson's paradox. Simpson's paradox occurs when there is a confounding variable that has not been included in the analysis. A similar paradox can also happen with intensity data, but the paradoxical nature of the outcome is less readily understood in that case.
Let's take a look at an example of Simpson's paradox taken from Wikipedia. The table is from a medical study that compared the success rates of two treatments for kidney stones. The table shows the success rates and the numbers of kidney stones on which they are based.
|Treatment A||Treatment B|
|Small Stones||93% (81/87)||87% (234/270)|
|Large Stones||73% (192/263)||69% (55/80)|
|Both||78% (273/350)||83% (289/350)|
In the table above you can see the treatment A is better both for small and large kidney stones. However, if stone size is not taken into account (row labelled "both") then Treatment B is better overall.
Why does this happen? This doesn't have to to do with the number of stones in the sample, but is due to the confounding between the size and the treatment. Notice that the overall success rate for small stones is 88% ( [81+234]/[87+270]) while for large stones it is only 72% ([192+55]/[263+80]). It turns out that the small stones are much easier to treat and most of the patients that got treatment B had small stones, while most of the patients that got treatment A had large stones. Because treatment B was applied to a larger set of more readily cured patients, it comes out ahead overall.
In this example, size is a "lurking variable". If it had not been measured, we would have assumed that treatment B was the better treatment.
Of course, when you see an unexpected outcome, it is often tempting to assume that there is a lurking variable that provides an explanation. In a court case alleging gender bias in pay scales, the defendent successfully claimed that education level was a lurking variable - women were paid less by the defendent because they had less education and therefore were in lower ranked job categories. After the trial, it was shown that in order for the defendent's claim to be true, the male employees would need to have, on average, 12 years of additional education compared to the women, which is highly unlikely. Unfortunately for the women involved (or fortunately for the defendent) the decision was not over-turned.
Some Other Problems with Tabulated Data
There other kinds of problems that happen with count data in tables. The table below, also from the leukemia study has the problem that led to the Challenger disaster!
If you recall, the Challenger space shuttle was due to be launched on a day with temperatures quite a bit lower than any previous launch. Prior to the launch, analysis was done to see whether or not this was a safe temperature for the launch. The problem that had been identified was that the O-rings sealing certain critical valves were thought to fail at low temperatures. However, the data analysts (none of them statisticians, by the way) did not find any relationship between air temperature and O-ring failure. Unfortunately, they were wrong. The O-rings on the Challenger failed, and all aboard the craft perished. Even more tragically, the trend was apparent in the data that was available before the flight. Had the data been handled properly, the launch would have been delayed.
The problem with the Challenger data is also evident in the table below. Do you see it?
We see ALL is more prevalent then AML and that there are about about equal numbers of males and females for each cancer type. However, you probably weren't paying close attention to the numbers of samples in the hospital breakdown of this data. There are 23 samples not listed in this table because gender was not recorded! This is what happened in the Challenger disaster. The data analysts used only the data from the flights in which there was at least one O-ring failure. They did not include in the data table the number of launches in which there were no O-ring failures, all of which were high temperature launches. Had they included all of the launches, they would have noticed that O-ring failure increases with lower temperature.
From the table above, we cannot determine if ALL is the more prevalent cancer. It might just be that ALL was more prevalent in the samples for which gender was recorded. How could this happen? We have already seen that the cancer type was confounded with hospital. Perhaps some hospitals neglected to record gender when they submitted the samples.
Are these problems with statistics? Or with common sense? When people see numbers common sense often goes out the window! In any case, as we saw with the Challenger example, we cannot just ignore missing information - we need to think about what is missing and why.
The table below shows all of the data. ALL is still more common than AML. However, it is notable that more of the AML data were submitted without gender. This could readily be explained if we had the 3-way table that includes gender and hospital, but even from the 2 tables given, it is apparent that hospital CALG did not record gender.
Often there are numerous covariates recorded along with the data and we may not want or need to use them all in the analysis. Nevertheless, It is always important to see all of the data before you start throwing things out so you have a good sense of what you have. For example, with sequencing data an important variable that is often ignored is the number of reads that could not be mapped to a unique feature. In many samples this number is small (although this may not be true for data collected when the technology was new and less accurate). However, in some samples, such as cancer samples, we might expect a fairly large percentage of unusual fragments that may not map to a reference. If we use only the mapped reads, we might not notice the differences among tumors or between tumor and normal cells.
Statisticians differentiate between data that are missing at completely at random, data are missing at random, and informative missingness. If data are missing completely at random, the only effect of the missing data on the statistical analysis is a reduced sample size. If the missing data are informative, then there is a lurking variable and they may have a huge effect on the analysis. Missing at random (as opposed to completely at random) means that the pattern of missing data can be attributed to variables that are observed (such as the hospital which sent the sample), which means that the effects on the analysis can be mitigated. The missing Challenger O-ring data was clearly informative - the data was missing when there were no failures, and this happened for high temperature launches.
Even if we cannot deal statistically with missing data, at least we should note it is missing so that when we interpret the results we can take this into account. There is it a difference between coming up with a p-value and then coming up with an explanation for the results. To fully understand our data, we need to understand both what has been recorded, and what would have been recorded under ideal conditions.