The Seal suite of distributed software for high-throughput sequencing

Luca Pireddu,Simone Leo,Gianluigi Zanetti

doi:10.14806/ej.17.b.267

Abstract

http://www.bioinfo.no/http://www.uni.no/computing/units/cbuModern DNA sequencing machines have opened the flood gates of whole genome data; and the current processing techniques are being washed away. Medium- sized sequencing laboratories can produce Terabytes of data per week that need processing. Unfortunately, most programs available for sequence processing are not designed to scale easily to such high data rates, nor are the typical bioinformatics workflow designs. As a consequence, many sequencing operations are left struggling to cope with the high data loads, often hoping that acquiring additional hardware will solve their problems. In contrast, we believe that a change in paradigm is required to solve this problem: a shift to highly parallelized software is required the handle the parallelization that has taken place in sequencing.In response to the growing processing requirements of the CRS4 Sequencing and Genotyping Platform (CSGP), which now houses 4 Illumina HiSeq 2000 sequencers for a total capacity of about 7000 Gbases/month, we began the development of Seal [3], a new suite of sequence processing tools based on the MapReduce [1] programming model that run on the Hadoop framework. Seal aims to replace many of the tools that are customarily used in sequencing workflows with Hadoop-based, scalable alternatives. Currently, Seal provides distributed MapReduce tools for: demultiplexing tagged reads, mapping reads to a reference (it includes a distributed version of the BWA aligner [2]), and sorting reads by alignment position. In the near future we will also be adding tools for read quality recalibration.Seal tools have been shown to scale well in the amount of input data and the amount of computational nodes available [4]; therefore, with Seal one can increase processing throughput by simply adding more computing nodes. Moreover, thanks to the robust platform provided by Hadoop, the effort required by operators to run the analyses on a large cluster is generally reduced, since Hadoop transparently handles most hardware and transient network problems, and provides a friendly web interface to monitor job progress and logs. Finally, the Hadoop Distributed File System (HDFS) provides a scalable storage system that scales its total throughput hand in hand with the number of processing nodes. Thus, it avoids creating a bottleneck at the shared storage volume and avoids the need for an expensive high-performance parallel storage device.Seal is currently in production use at the CRS4 Sequencing and Genotyping Platform and is being evaluated at other various sequencing centers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Seal suite of distributed software for high-throughput sequencing

Abstract

Talk to us

Similar Papers

More From: EMBnet.journal

Lead the way for us

Journal: EMBnet.journal	Publication Date: Feb 28, 2012
License type: cc-by-nc-sa

Similar Papers

A Novel Approach for Improved Data Replication Using HDFS
T Prasuna ... D Chakradhar Babu
-
T Prasuna, et. al.T Prasuna ... D Chakradhar Babu
01 May 2018
01 May 2018

PH2
Scott Hazelhurst
-
Scott HazelhurstScott Hazelhurst
11 Oct 2010
11 Oct 2010

An efficient replication management system for HDFS management
Korla Swaroopa ... Rabinarayan Satpathy
Materials Today: Proceedings | VOL. 80
Korla Swaroopa, et. al.Korla Swaroopa ... Rabinarayan Satpathy
17 Jul 2021
Materials Today: Proceedings | VOL. 80

Teaching Hadoop Using Role Play Games
Zhiguo Yang ... Xiang Guo
Decision Sciences Journal of Innovative Education | VOL. 18
Zhiguo Yang, et. al.Zhiguo Yang ... Xiang Guo
01 Jan 2020
Decision Sciences Journal of Innovative Education | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Seal suite of distributed software for high-throughput sequencing

Abstract

Talk to us

Similar Papers

More From: EMBnet.journal