1.12 - Final Thoughts

Printer-friendly versionPrinter-friendly version

The Internet

Although we tend to take the internet for granted these days, the "omics" revolution could not have taken place without it.  The internet has made it possible for scientists to develop and share software and store massive amounts of data including the reference sequences, annotation, and the scientific journal articles we all use to help us design studies and interpret the outcomes.  What is more, much of the software and information is freely available including

  • documentation
  • documentation tools
  • sequence matching tools
  • statistical analysis tools
  • visualization tools
  • tools to organize the tools!

However, just like other information on the internet, the quality of what you can find is uneven.  It is best to use data and software from trusted sites.

In the US, any "omics" project funded from federal courses such as the National Science Foundation (NSF), National Institutes of Health (NIH), US Dept. of Agriculture (USDA) etc should be freely available via the Internet. Many of the higher quality journals also insist that the data be available either from curated repositories or from the journal website.   However, although the data are available, the quality of the documentation is often very poor.  For example, you might find microarray data with no information about the probes or missing information about the samples.  Sequencing data may lack information about the total library sizes, and so on.   Even if the data are available and documented, it might not be possible to replicate the analyses from the associated papers because the settings for the mapping software, the preprocessing steps, etc may not be recorded.  It seems that documentation of what we have done is the very last thing that seems to take place after our paper is published or contract is fulfilled.

One of the objectives of this course is to understand what makes an experiment replicable, and how to do our analyses in a way that is replicable as well.

The Challenges Ahead

There are a lot of challenges in working with genomics data. The scientific questions are very deep. We have masses of information, but we still don't know how the bits and pieces fit together.  An analogy might be being introduced to a hardware store where you have all of the screws, nails and tools but no instructions and very little information about which pieces should be used together.  There are huge possible impacts on human life in terms of tinkering with disease, crops, embryos, etc.   Both the scientific and ethical issues are enormous and growing, as the tools available to the lab scientists become more and more sophisticated.

One of the challenges for statisticians is trying to keep up with the rapidly changing technologies.  Important distribution characteristics of the data change as the the technology changes, and the technology is changing extremely rapidly.  Although our top-level analyses such as clustering and t-tests will continue to be valid, the data cleaning and preprocessing steps change with the technology.  Often the developers of the technology also provide data analysis tools, but these are not always correct or efficient.  In addition, the analysis of the data requires at least some knowledge of the rapidly evolving web-based knowledge repository, some of which has self replicating errors. And, partly because we are using so many tools, we don't always know where these errors are creeping in. 

There is also the p greater than n problem. In statistics, p is the number of features (genes, exons, SNPs or whatever it is that you are measuring). n is the number of samples. It just happens to be a fact of life that if you have more features than you have samples then with your current data you could always do perfect classification and perfect prediction although your classification or prediction rule would like not work correctly with a new sample.  Even if the data are just noise, when p>n there would be a unique set of noise that would, for instance, divide the tumors from the normals.  In some medical studies, we now have quite large sample sizes as well, in the 10s of thousands of patients.  However, since we often have millions of features such as SNPs, we might still have p greater than n and have even more massive data to deal with.  In the core scientific disciplines, we are lucky if we have 10s of samples.  So we cannot rely on statistics alone - we need to bring other information into our analyses.

Another challenge is the sheer the amount of data that we need to handle. For example, an RNA-seq sample is 25 million or more short sequences per sample which can be 10 Gb or more.  This means that the data are not readily up- and down-loaded, and are difficult to handle on a laptop or desktop computer. The data are also very noisy, and as we try to understand this noise we find that much of this noise depends on the measurement technology that's being used.

icon for collaboration - four individuals and a plus signThe most successful researchers have either trained themselves to be very cross-disciplinary, with knowledge about biology, computer science and statistics, or have assembled cross-disciplinary teams of collaborators.  A barrier to collaboration is the impenetrable jargon coming from biology, computer science and statistics.  Even scientists in the same discipline may not use the jargon in the same way.  Personally I deal with this by asking lots and lots of questions.  This often helps open dialogs that greatly improve communication.