SRAdb: query and use public next-generation sequencing data from within R

Yuelin Zhu,Robert M Stephens,Sean R Davis,Paul S Meltzer

doi:10.1186/1471-2105-14-19

Abstract

BackgroundThe Sequence Read Archive (SRA) is the largest public repository of sequencing data from the next generation of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc), Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, and others.ResultsSRAdb is an attempt to make queries of the metadata associated with SRA submission, study, sample, experiment and run more robust and precise, and make access to sequencing data in the SRA easier. We have parsed all the SRA metadata into a SQLite database that is routinely updated and can be easily distributed. The SRAdb R/Bioconductor package then utilizes this SQLite database for querying and accessing metadata. Full text search functionality makes querying metadata very flexible and powerful. Fastq files associated with query results can be downloaded easily for local analysis. The package also includes an interface from R to a popular genome browser, the Integrated Genomics Viewer.ConclusionsSRAdb Bioconductor package provides a convenient and integrated framework to query and access SRA metadata quickly and powerfully from within R.

Highlights

The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the generation of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc), Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, and others
Results and discussion we will give an overview of the functionality of the SRAdb package starting with installation, querying of SRA metadata, retrieval of SRA data based on query results, and an example of how to control the IGV browser from within R
We aim to find all run and study combined records in which any given fields have “breast” and “cancer” words: rs = getSRA(search terms = "breast cancer", out types = c("run", "study"), sra con = sra con)

Summary

Introduction

The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the generation of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc), Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, and others. The Sequence Read Archive (SRA, [1]) has been set up at NCBI in the United States, EMBL in Europe, and DDBJ in Japan to capture these data in public repositories in much the same spirit as MIAME-compliant microarray databases like NCBI Gene Expression Omnibus (GEO) and EBI ArrayExpress. As these public data resources continue to grow, the opportunities to leverage them for comparison to private data or to generate novel hypotheses will grow. Because of our own need to visualize processed data in bulk, in the context of experimental metadata, the package provides functions to let R interacts with a powerful and feature-rich genome browser - the Integrative Genomics Viewer (IGV, [4]) for data visualization and exploration

Objectives

Methods

Results

Conclusion