Abstract

The NCBI Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. SRA makes metadata and raw sequencing data available to the research community to encourage reproducibility and to provide avenues for testing novel hypotheses on publicly available data. However, methods to programmatically access this data are limited. We introduce the Python package, pysradb, which provides a collection of command line methods to query and download metadata and data from SRA, utilizing the curated metadata database available through the SRAdb project. We demonstrate the utility of pysradb on multiple use cases for searching and downloading SRA datasets. It is available freely at https://github.com/saketkc/pysradb.

Highlights

  • Several projects have made efforts to analyze and publish summaries of DNA-1 and RNA-seq[2,3] datasets

  • Obtaining metadata and raw data from the NCBI Sequence Read Archive (SRA)[4] is often the first step towards reanalyzing public next-generation sequencing datasets in order to compare them to private data or test a novel hypothesis

  • The NCBI SRA toolkit[5] provides utility methods to download raw sequencing data, while the metadata can be obtained by querying the website or through the Entrez efetch command line utility[6]

Read more

Summary

Introduction

Several projects have made efforts to analyze and publish summaries of DNA-1 and RNA-seq[2,3] datasets. SRAdb tracks the five main data objects in SRA’s metadata: submission, study, sample, experiment and run These are mapped to five different relational database tables that are made available in the SQLite file. The pysradb package[10] builds upon the principles of SRAdb, providing a simple and user-friendly commandline interface for querying metadata and downloading datasets from SRA. It provides utility functions that will further help a user perform more granular queries, which are often required when dealing with multiple datasets on a large scale By enabling both metadata search and download operations at the command-line, pysradb aims to bridge the gap in seamlessly retrieving public sequencing datasets and the associated metadata. To simplify the installation procedure for the end-user, it is available for download through PyPI and bioconda[12]

Methods
R Core Team
13. McKinney W
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call