SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Marek S Wiewiórka,Alicja Pacholewska,Michał J Okoniewski,Sergio Maffioletti,Antonio Messina,Piotr Gawrysiak

doi:10.1093/bioinformatics/btu343

Marek S Wiewiórka, Alicja Pacholewska + Show 4 more

Open Access

https://doi.org/10.1093/bioinformatics/btu343

Copy DOI

Journal: Bioinformatics (Oxford, England)	Publication Date: May 19, 2014
Citations: 102	License type: other-oa

Affiliation: University of Bern

Abstract

Many time-consuming analyses of next -: generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics BECAUSE OF: their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes. Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)

Lead the way for us

Similar Papers

REEF
Byung-Gon Chun ... Carlo Curino
Proceedings of the VLDB Endowment | VOL. 6
Byung-Gon Chun, et. al.Byung-Gon Chun ... Carlo Curino
01 Aug 2013
Proceedings of the VLDB Endowment | VOL. 6

A novel approach to optimization of iterative machine learning algorithms: Over heap structure
Hasan Kurban ... Mehmet M Dalkilic
-
Hasan Kurban, et. al.Hasan Kurban ... Mehmet M Dalkilic
01 Dec 2017
01 Dec 2017

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks
Zhengyu Yang ... Bo Sheng
-
Zhengyu Yang, et. al.Zhengyu Yang ... Bo Sheng
01 Jul 2018
01 Jul 2018

Large-scale data mining analytics based on MapReduce

-

01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)