FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

Umberto Ferraro Petrillo,Francesco Palini,Giuseppe Cattaneo,Raffaele Giancarlo

doi:10.1186/s12859-021-04063-1

Umberto Ferraro Petrillo, Francesco Palini + Show 2 more

Open Access

https://doi.org/10.1186/s12859-021-04063-1

Copy DOI

Abstract

BackgroundStorage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic.ResultsWe provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.ConclusionsOur Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future.AvailabilityThe software and the datasets are available at https://github.com/fpalini/fastdoopc

Highlights

Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods
Disk space and reading time savings apply to the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System
The intent of Experiments 1-3 is to provide evidence of the space and time performance advantages deriving from the adoption of specialized FASTA/Q compressors within MapReduce-Hadoop

Summary

Introduction

Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. The relation between Big Data Technologies and FASTA/Q data compression in bioinformatics Due to the same reasons of massive data production, the development and use of Big Data Technologies for Genomics and the Life Sciences, have been indicated as directions to be actively pursued [9], with MapReduce [10], Hadoop [11] and Spark [12] being the preferred ones [13]. Processing files compressed using a non-splittable format is still possible under Hadoop, but at a cost of very long decompression times (data not shown but available upon request). Further discussion on those topics is in section “Preliminary”. We refer to the former category of data compressors as splittable Codecs

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 22, 2021
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Трансформація політичних виборчих кампаній в еру Big Data технологій
Vira Maksymets ... Tetiana Sviridova
Міжнародні відносини, суспільні комунікації та регіональні студії | VOL. -
Vira Maksymets, et. al.Vira Maksymets ... Tetiana Sviridova
31 Oct 2019
Міжнародні відносини, суспільні комунікації та регіональні студії | VOL. -

POCLib: A High-Performance Framework for Enabling Near Orthogonal Processing on Compression
Feng Zhang ... Jidong Zhai
IEEE Transactions on Parallel and Distributed Systems | VOL. 33
Feng Zhang, et. al.Feng Zhang ... Jidong Zhai
01 Feb 2022
IEEE Transactions on Parallel and Distributed Systems | VOL. 33

Modifying Bit-Level Data Compression Scheme based on Adaptive Hamming Code Data Compression Algorithm
Harwin C Mendoza ... Ruji P Medina
-
Harwin C Mendoza, et. al.Harwin C Mendoza ... Ruji P Medina
01 May 2019
01 May 2019

Deep Learning Algorithms for Secure and Efficient Compression of Genomic Sequence Data
Raveendra Gudodagi ... K Thirumala Akash
-
Raveendra Gudodagi, et. al.Raveendra Gudodagi ... K Thirumala Akash
10 Feb 2023
10 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics