Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

Matthew N Bernstein,Ariella Gladstein,Allissa Dillman,Emily Clough,Ben Busby,Khun Zaw Latt

doi:10.12688/f1000research.23180.1

Abstract

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

Highlights

The Sequence Read Archive (SRA; Leinonen et al, 2011) is a large public repository that stores next-generation sequencing data from thousands of diverse scientific investigations
Reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples (Gonçalves & Musen, 2019)
The MetaSRA is not capable of searching for samples associated with a particular condition and/or tissue-type that are ordered according to a numeric property

Summary

Introduction

The Sequence Read Archive (SRA; Leinonen et al, 2011) is a large public repository that stores next-generation sequencing data from thousands of diverse scientific investigations. Reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples (Gonçalves & Musen, 2019). The MetaSRA project (Bernstein et al, 2017) standardized these metadata by annotating each sample with terms from biomedical ontologies including Cell Ontology (Bard et al, 2005), Uberon (Mungall et al, 2012), Disease Ontology (Schriml et al, 2019), Cellosaurus (Bairoch, 2018), and the Experimental Factors Ontology (Malone et al, 2010). The MetaSRA web interface is not capable of producing structured datasets such as those that match case samples associated with a target condition or disease with healthy control samples. The MetaSRA is not capable of searching for samples associated with a particular condition and/or tissue-type that are ordered according to a numeric property (e.g., age)

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: F1000Research	Publication Date: May 19, 2020
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Similar Papers

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive
Matthew N Bernstein ... Emily Clough
F1000Research | VOL. 9
Matthew N Bernstein, et. al.Matthew N Bernstein ... Emily Clough
04 Aug 2020
F1000Research | VOL. 9

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive
Matthew Bernstein ... Ben Busby
F1000Research | VOL. 9
Matthew Bernstein, et. al.Matthew Bernstein ... Ben Busby
23 Jul 2020
F1000Research | VOL. 9

Don't just dump your data and run: Authors should submit as much experimental information as possible when uploading sequence data.
Matheus Sanitá Lima ... David Roy Smith
EMBO reports | VOL. 18
Matheus Sanitá Lima, et. al.Matheus Sanitá Lima ... David Roy Smith
27 Oct 2017
EMBO reports | VOL. 18

Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive
Takeru Nakazato ... Ramy K Aziz
PLoS ONE | VOL. 8
Takeru Nakazato, et. al.Takeru Nakazato ... Ramy K Aziz
22 Oct 2013
PLoS ONE | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research