CoMSA: compression of protein multiple sequence alignment files.

Sebastian Deorowicz,Joanna Walczyszyn,Agnieszka Debudaj-Grabysz

doi:10.1093/bioinformatics/bty619

Abstract

Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary data are available at Bioinformatics online.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CoMSA: compression of protein multiple sequence alignment files.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Jul 13, 2018
Citations: 10

Similar Papers

ACTF: An efficient lossless compression algorithm for time series floating point data
Weijie Wang ... Huihuang Zhao
Journal of King Saud University - Computer and Information Sciences | VOL. 36
Weijie Wang, et. al.Weijie Wang ... Huihuang Zhao
16 Nov 2024
Journal of King Saud University - Computer and Information Sciences | VOL. 36

Research on Compression Storage of Massive Agricultural Data Based on Cloud Environment
...
Applied Mechanics and Materials | VOL. 441
, et. al. ...
01 Dec 2013
Applied Mechanics and Materials | VOL. 441

DETERMINING OPTIMAL COMPRESSION ALGORITHM FOR FILES OF DIFFERENT FORMATS
A V Vashchenko ... Ie A Drozdova
Вісник Херсонського національного технічного університету | VOL. -
A V Vashchenko, et. al.A V Vashchenko ... Ie A Drozdova
01 Jul 2024
Вісник Херсонського національного технічного університету | VOL. -

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts
Xin Deng ... Jianlin Cheng
BMC Bioinformatics | VOL. 12
Xin Deng, et. al.Xin Deng ... Jianlin Cheng
01 Dec 2011
BMC Bioinformatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CoMSA: compression of protein multiple sequence alignment files.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics