A Quick Guide to Large-Scale Genomic Data Mining

Curtis Huttenhower,Oliver Hofmann,Fran Lewitter

doi:10.1371/journal.pcbi.1000779

Curtis Huttenhower, Oliver Hofmann + Show 1 more

Open Access

https://doi.org/10.1371/journal.pcbi.1000779

Copy DOI

Journal: PLoS Computational Biology	Publication Date: May 27, 2010
Citations: 116	License type: CC BY 4.0

Affiliation: Harvard University

Abstract

For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection. Our newfound data-richness, however, has shifted this bottleneck from collection to analysis [1]. While a variety of options exists for examining any one experimental dataset, we are still discovering what new biological questions can be answered by mining thousands of genomic datasets in tandem, potentially spanning different molecular activities, technological platforms, and model organisms. As an analogy, consider the difference between searching one document for a keyword and executing an online search. While the tasks are conceptually similar, they require vastly different underlying methodologies, and they have correspondingly large differences in their potentials for knowledge discovery. Large-scale genomic data mining is thus the process of using many (potentially diverse) datasets, often from public repositories, to address a specific biological question. Statistical meta-analyses are an excellent example, in which many experimental results are examined in order to lend statistical power to a hypothesis test (e.g., for differential expression) [2], [3]. As the amount of available genomic data grows, however, exploratory methods allowing hypothesis generation are also becoming more prevalent. The ArrayExpress Gene Expression Atlas, for example, allows users to examine hundreds of experimental factors across thousands of independent experimental results [4]. In most cases, though, an investigator with a specific question in mind must collect relevant data to bring to bear on a question of interest. Some examples might be: If you've obtained a gene set of interest, in which tissues or cell lines are they coexpressed? If you assay a particular cellular environment, are there other experimental conditions that incur a similar genomic response? If you have high-specificity, low-throughput data for a few genes, with what other genes do they interact or coexpress in high-throughput data repositories? Under what experimental conditions, or in which tissues? Bringing large quantities of genomic data to bear on such questions involves three main tasks: establishing methodology for efficiently querying large data collections; assembling data from appropriate repositories; and integrating information from a variety of experimental data types. Since the technical [5]–[7] and methodological [8]–[10] challenges in heterogeneous data integration have been discussed elsewhere, this introduction will focus mainly on the first two points. As discussed below, the computational requirements for processing thousands of whole-genome datasets in a reasonable amount of time must be addressed, either algorithmically or using cloud or distributed computing [11], [12]. Subsequently, data collection is sometimes easy—as is increasingly the case for high-throughput sequencing, individual experiments can themselves be the sources of large data repositories. In other cases, a biological investigation might benefit from the inclusion of substantial external or public data.

Highlights

For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection
Integrated results and data portals are available for many model organisms, including HEFalMp [16], Endeavour [17], and the Prioritizer [18] for human data, integrated within- [19] and acrossspecies [20] results for Caenorhabditis elegans, bioPIXIE [21] and SPELL [22] for Saccharomyces cerevisiae, and a variety of tools for other systems [23,24,25]
If you’re interested in identifying potential targets of yeast cell cycle kinases under a variety of culture growth conditions, even a relatively complex large-scale computational screen will likely be simpler than running new corresponding high-throughput assays: 1. By examining the S. cerevisiae Gene Ontology (GO) [34] annotations at the Saccharomyces Genome Database [35], we find that the intersection between the cell cycle process (669 genes) and the protein kinase activity function (135 genes, both terms downloadable at AmiGO [36]) yields a list of 51 genes

Summary

Introduction

For the first several hundred years of research in cellular biology, the main bottleneck to scientific progress was data collection. While a variety of options exists for examining any one experimental dataset, we are still discovering what new biological questions can be answered by mining thousands of genomic datasets in tandem, potentially spanning different molecular activities, technological platforms, and model organisms. Large-scale genomic data mining is the process of using many (potentially diverse) datasets, often from public repositories, to address a specific biological question. Though, an investigator with a specific question in mind must collect relevant data to bring to bear on a question of interest. Bringing large quantities of genomic data to bear on such questions involves three main tasks: establishing methodology for efficiently querying large data collections; assembling data from appropriate repositories; and integrating information from a variety of experimental data types. A biological investigation might benefit from the inclusion of substantial external or public data

Methods and Pitfalls in Manipulating Genomic Data

Genomic Data Resources

Coordinated activity and regulatory hubs

Other Genomic Data Types and Sources

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Quick Guide to Large-Scale Genomic Data Mining

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Survey Research: A Primer for Hand Surgery
Deborah A Schwartz
Journal of Hand Therapy | VOL. 20
Deborah A SchwartzDeborah A Schwartz
01 Jul 2007
Journal of Hand Therapy | VOL. 20

Researcher attitudes toward data sharing in public data repositories: a meta-evaluation of studies on researcher data sharing
Jennifer L. Thoegersen ... Pia Borlund
Journal of Documentation | VOL. 78
Jennifer L. Thoegersen, et. al.Jennifer L. Thoegersen ... Pia Borlund
31 May 2021
Journal of Documentation | VOL. 78

Submission of Microarray Data to Public Repositories
Catherine A Ball ... Paul Spellman
PLoS Biology | VOL. 2
Catherine A Ball, et. al.Catherine A Ball ... Paul Spellman
31 Aug 2004
PLoS Biology | VOL. 2

Generation of Comprehensive Thoracic Oncology Database - Tool for Translational Research
Mosmi Surati ... Theodore Karrison
Journal of Visualized Experiments | VOL. 1
Mosmi Surati, et. al.Mosmi Surati ... Theodore Karrison
22 Jan 2011
Journal of Visualized Experiments | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Quick Guide to Large-Scale Genomic Data Mining

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology