BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Jinxiang Chen,Shuqin Li,Tatiana T Marquez-Lago,Miao Wang,André Leier,Quanzhong Liu,Junlong Li,Jiangning Song,Fuyi Li,Jerico Revote

doi:10.3389/fdata.2021.727216

Abstract

BackgroundSimple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.ResultsIn this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.ConclusionsThe excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

Highlights

Simple Sequence Repeats (SSRs), known as short tandem repeats (STRs) or microsatellites (Fan and Chu, 2007; Madesis et al, 2013), are highly mutable nucleotide sequences (Vargas Jentzsch et al, 2013)
We propose BigFiRSt (Big data-based Flash and peRf algorithm for mining Ssrs), a novel Hadoop-based program suite and is designed to integrate paired-end reads merging and SSRs search into an effective computational pipeline
For handling massive datasets and facilitating the data process using local computers, we provide the source codes of BigFiRSt for download https://github.com/JinxiangChenHome/BigFiRSt such that users can configure and execute the BigFiRSt program on a cluster supported by the Hadoop

Summary

Introduction

Simple Sequence Repeats (SSRs), known as short tandem repeats (STRs) or microsatellites (Fan and Chu, 2007; Madesis et al, 2013), are highly mutable nucleotide sequences (Vargas Jentzsch et al, 2013). With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in big data	Publication Date: Jan 18, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in big data

Lead the way for us

Similar Papers

Next-Generation Sequencing Strategies Enable Routine Detection of Balanced Chromosome Rearrangements for Clinical Diagnostics and Genetic Research
Michael E Talkowski ... James F Gusella
The American Journal of Human Genetics | VOL. 88
Michael E Talkowski, et. al.Michael E Talkowski ... James F Gusella
01 Apr 2011
The American Journal of Human Genetics | VOL. 88

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next-Generation Sequencing Data
David H Spencer ... Eric J Duncavage
The Journal of Molecular Diagnostics | VOL. 15
David H Spencer, et. al.David H Spencer ... Eric J Duncavage
14 Nov 2012
The Journal of Molecular Diagnostics | VOL. 15

Filtering with alignment free distances for high throughput DNA reads assembly
Maria C De Cola ... Daniele Santoni
EMBnet.journal | VOL. 18
Maria C De Cola, et. al.Maria C De Cola ... Daniele Santoni
09 Nov 2012
EMBnet.journal | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in big data