Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive

Takeru Nakazato,Tazro Ohta,Hidemasa Bono

doi:10.1371/journal.pone.0077910

Takeru Nakazato, Tazro Ohta + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0077910

Copy DOI

Journal: PLoS ONE	Publication Date: Oct 22, 2013
Citations: 47	License type: CC BY 4.0

Affiliation: Research Organization of Information and Systems

Abstract

High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called “Gendoo”. We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called “DBCLS SRA” (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA.

Highlights

High-throughput sequencing technology is a powerful technique for determination of an entire genome sequence and quantification of the transcriptome at base-pair resolution with a large dynamic range
The Sequence Read Archive (SRA) database is a primary archive of public highthroughput sequencing data, and provides experimental designs such as project titles and sequencers along with raw sequences as six objects of metadata XML files
In SRA, additional experiments are often assigned to a previous project and deposited as a new submission, the number of submission exceeds that of projects, and many submissions contain only a partial set of metadata files, even excluded ‘‘analysis’’ files from consideration, which are optional for submission (Figure 1)

Summary

Introduction

High-throughput sequencing technology is a powerful technique for determination of an entire genome sequence and quantification of the transcriptome at base-pair resolution with a large dynamic range. The sequencers using massively parallel sequencing technology, called next-generation sequencer (NGS), drastically reduce the cost and time of sequencing, compared with previous methods, and is rapidly becoming the technology of choice for such purposes [1] This type sequencers yields a vast quantity of captured images, in-process files, and numerous sequence reads, requiring an extensive amount of disk space [2]. Such data are important for researchers and should be shared, as are the nucleotide sequences in GenBank and microarray data in the Gene Expression Omnibus (GEO).

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Don't just dump your data and run: Authors should submit as much experimental information as possible when uploading sequence data.
Matheus Sanitá Lima ... David Roy Smith
EMBO reports | VOL. 18
Matheus Sanitá Lima, et. al.Matheus Sanitá Lima ... David Roy Smith
27 Oct 2017
EMBO reports | VOL. 18

A Challenge to Integrate Bioinformatics and Biodiversity Informatics Data as Museomics
Takeru Nakazato
Biodiversity Information Science and Standards | VOL. 2
Takeru NakazatoTakeru Nakazato
22 May 2018
Biodiversity Information Science and Standards | VOL. 2

Post-archival genomics and the bulk logistics of DNA sequences
Adrian Mackenzie ... Stuart Sharples
BioSocieties | VOL. 11
Adrian Mackenzie, et. al.Adrian Mackenzie ... Stuart Sharples
29 Jun 2015
BioSocieties | VOL. 11

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.
Tazro Ohta ... Hidemasa Bono
GigaScience | VOL. 6
Tazro Ohta, et. al.Tazro Ohta ... Hidemasa Bono
25 Apr 2017
GigaScience | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE