CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Md Ashiqur Rahman,Abdullah Aman Tutul,Sifat Muhammad Abdullah,Md Shamsuzzoha Bayzid

doi:10.1371/journal.pone.0265360

Md Ashiqur Rahman, Abdullah Aman Tutul + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0265360

Copy DOI

Abstract

High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.g. gzip, bzip2, 7-zip) are being widely used due to their pervasiveness and relatively good speed. However, they are not customized for genomic data and may fail to leverage special characteristics and redundancy of the biomolecular sequences. We present a new lossless compression method CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), which is especially designed for multiple sequence alignments (MSAs) of biomolecular data and offers very good compression gain. We have introduced a novel hierarchical referencing technique to represent biomolecular sequences which combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. We performed an extensive evaluation study using a collection of real biological data from the avian phylogenomics project, 1000 plants project (1KP), and 16S and 23S rRNA datasets. We report the performance of CHAPAO in comparison with general purpose compression techniques as well as with MFCompress and Nucleotide Archival Format (NAF)-two of the best known methods especially designed for FASTA files. Experimental results suggest that CHAPAO offers significant improvements in compression gain over most other alternative methods. CHAPAO is freely available as an open source software at https://github.com/ashiq24/CHAPAO. CHAPAO advances the state-of-the-art in compression algorithms and represents a potential alternative to the general purpose compression techniques as well as to the existing specialized compression techniques for biomolecular sequences.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Abstract

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Journal: PLOS ONE	Publication Date: Apr 18, 2022
License type: CC BY 4.0

Similar Papers

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments
Sifat Muhammad Abdullah ... Ram Kumar Sharma
-
Sifat Muhammad Abdullah, et. al.Sifat Muhammad Abdullah ... Ram Kumar Sharma
18 Apr 2022
18 Apr 2022

Optimizing XML Compression
Gregory Leighton ... Denilson Barbosa
-
Gregory Leighton, et. al.Gregory Leighton ... Denilson Barbosa
01 Jan 2009
01 Jan 2009

Integration of Alignment and Phylogeny in the Whole-Genome Era

-

18 Jun 2015
18 Jun 2015

GenCoder: A Novel Convolutional Neural Network Based Autoencoder for Genomic Sequence Data Compression.
Sheena K S ... Madhu S Nair
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 21
Sheena K S, et. al.Sheena K S ... Madhu S Nair
01 May 2024
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Abstract

Talk to us

Similar Papers

More From: PLOS ONE