Abstract
In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.
Highlights
Motivated by the dramatic drop in price in genomic sequencing, the research community is continuously increasing the volume of sequenced data
Our compression tool relies on probabilistic models, known as finite-context models, that are quite effective for DNA data compression [5]
We present the compression results obtained using several popular general compression methods, such as gzip, bzip2, ppmd, and lzma as well as by method [24], the maf-bgzip tool [25,26,27] and by the compression tool described in this paper
Summary
Motivated by the dramatic drop in price in genomic sequencing, the research community is continuously increasing the volume of sequenced data. Whole Genome Alignments tend to be very large, occupying several hundreds of gigabytes of disk space and containing millions of alignment blocks Handling data at this scale presents several challenges in download speed and storage space. In the last two decades, several specialized algorithms for compressing DNA sequences and several other forms of genomic data were developed and used by the research community (e.g., [3,4,5,6,7,8,9,10,11,12,13,14,15]) Most of these algorithms take into account only the four-letter alphabet, ACGT.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.