Abstract

For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection. Our newfound data-richness, however, has shifted this bottleneck from collection to analysis [1]. While a variety of options exists for examining any one experimental dataset, we are still discovering what new biological questions can be answered by mining thousands of genomic datasets in tandem, potentially spanning different molecular activities, technological platforms, and model organisms. As an analogy, consider the difference between searching one document for a keyword and executing an online search. While the tasks are conceptually similar, they require vastly different underlying methodologies, and they have correspondingly large differences in their potentials for knowledge discovery. Large-scale genomic data mining is thus the process of using many (potentially diverse) datasets, often from public repositories, to address a specific biological question. Statistical meta-analyses are an excellent example, in which many experimental results are examined in order to lend statistical power to a hypothesis test (e.g., for differential expression) [2], [3]. As the amount of available genomic data grows, however, exploratory methods allowing hypothesis generation are also becoming more prevalent. The ArrayExpress Gene Expression Atlas, for example, allows users to examine hundreds of experimental factors across thousands of independent experimental results [4]. In most cases, though, an investigator with a specific question in mind must collect relevant data to bring to bear on a question of interest. Some examples might be: If you've obtained a gene set of interest, in which tissues or cell lines are they coexpressed? If you assay a particular cellular environment, are there other experimental conditions that incur a similar genomic response? If you have high-specificity, low-throughput data for a few genes, with what other genes do they interact or coexpress in high-throughput data repositories? Under what experimental conditions, or in which tissues? Bringing large quantities of genomic data to bear on such questions involves three main tasks: establishing methodology for efficiently querying large data collections; assembling data from appropriate repositories; and integrating information from a variety of experimental data types. Since the technical [5]–[7] and methodological [8]–[10] challenges in heterogeneous data integration have been discussed elsewhere, this introduction will focus mainly on the first two points. As discussed below, the computational requirements for processing thousands of whole-genome datasets in a reasonable amount of time must be addressed, either algorithmically or using cloud or distributed computing [11], [12]. Subsequently, data collection is sometimes easy—as is increasingly the case for high-throughput sequencing, individual experiments can themselves be the sources of large data repositories. In other cases, a biological investigation might benefit from the inclusion of substantial external or public data.

Highlights

  • For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection

  • Integrated results and data portals are available for many model organisms, including HEFalMp [16], Endeavour [17], and the Prioritizer [18] for human data, integrated within- [19] and acrossspecies [20] results for Caenorhabditis elegans, bioPIXIE [21] and SPELL [22] for Saccharomyces cerevisiae, and a variety of tools for other systems [23,24,25]

  • If you’re interested in identifying potential targets of yeast cell cycle kinases under a variety of culture growth conditions, even a relatively complex large-scale computational screen will likely be simpler than running new corresponding high-throughput assays: 1. By examining the S. cerevisiae Gene Ontology (GO) [34] annotations at the Saccharomyces Genome Database [35], we find that the intersection between the cell cycle process (669 genes) and the protein kinase activity function (135 genes, both terms downloadable at AmiGO [36]) yields a list of 51 genes

Read more

Summary

Introduction

For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection. While a variety of options exists for examining any one experimental dataset, we are still discovering what new biological questions can be answered by mining thousands of genomic datasets in tandem, potentially spanning different molecular activities, technological platforms, and model organisms. Large-scale genomic data mining is the process of using many (potentially diverse) datasets, often from public repositories, to address a specific biological question. Though, an investigator with a specific question in mind must collect relevant data to bring to bear on a question of interest. Bringing large quantities of genomic data to bear on such questions involves three main tasks: establishing methodology for efficiently querying large data collections; assembling data from appropriate repositories; and integrating information from a variety of experimental data types. A biological investigation might benefit from the inclusion of substantial external or public data

Methods and Pitfalls in Manipulating Genomic Data
Genomic Data Resources
Coordinated activity and regulatory hubs
Other Genomic Data Types and Sources
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.