Abstract
The NCBI Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. SRA makes metadata and raw sequencing data available to the research community to encourage reproducibility and to provide avenues for testing novel hypotheses on publicly available data. However, methods to programmatically access this data are limited. We introduce the Python package, pysradb, which provides a collection of command line methods to query and download metadata and data from SRA, utilizing the curated metadata database available through the SRAdb project. We demonstrate the utility of pysradb on multiple use cases for searching and downloading SRA datasets. It is available freely at https://github.com/saketkc/pysradb.
Highlights
Several projects have made efforts to analyze and publish summaries of DNA-1 and RNA-seq[2,3] datasets
Obtaining metadata and raw data from the NCBI Sequence Read Archive (SRA)[4] is often the first step towards reanalyzing public next-generation sequencing datasets in order to compare them to private data or test a novel hypothesis
The NCBI SRA toolkit[5] provides utility methods to download raw sequencing data, while the metadata can be obtained by querying the website or through the Entrez efetch command line utility[6]
Summary
Several projects have made efforts to analyze and publish summaries of DNA-1 and RNA-seq[2,3] datasets. SRAdb tracks the five main data objects in SRA’s metadata: submission, study, sample, experiment and run These are mapped to five different relational database tables that are made available in the SQLite file. The pysradb package[10] builds upon the principles of SRAdb, providing a simple and user-friendly commandline interface for querying metadata and downloading datasets from SRA. It provides utility functions that will further help a user perform more granular queries, which are often required when dealing with multiple datasets on a large scale By enabling both metadata search and download operations at the command-line, pysradb aims to bridge the gap in seamlessly retrieving public sequencing datasets and the associated metadata. To simplify the installation procedure for the end-user, it is available for download through PyPI and bioconda[12]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.