Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.

Diogo Pratas,Armando J Pinho,Raquel M Silva

doi:10.3390/e20060393

Diogo Pratas, Armando J Pinho + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/e20060393

Copy DOI

Export

Save

Cite

Journal: Entropy	Publication Date: May 23, 2018
Citations: 7	License type: CC BY 4.0

Affiliation: University of Aveiro

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.

Highlights

In 1965 [1], Kolmogorov described three ways to measure the information contained in strings: combinatorial [2,3], probabilistic [4] and algorithmic
We provide the description of the primate dataset and the parameters used in the compressor as well as a benchmark of the compressor in different compression modes, we make some previsions applying alterations on the datasets and, provide the empirical results
All the results presented in this paper can be reproduced, under a Linux OS, using the scripts provided at the repository https://github.com/pratas/APE, runNC.sh, runNCD.sh, runNRC.sh, runReferenceFreeComparison.sh, runReferenceFreeConjoint.sh, runRelativeCompressors

Summary

Introduction

In 1965 [1], Kolmogorov described three ways to measure the information contained in strings: combinatorial [2,3], probabilistic [4] and algorithmic. The algorithmic approach, known as Kolmogorov complexity or algorithmic entropy, enables measurement and comparison of the information (or complexity) contained in different natural processes that can be expressed using sequences of symbols (strings) from a finite alphabet [1,5,6,7,8,9,10,11,12,13]. The Kolmogorov complexity differs from the Shannon entropy [4], because it considers that the source, rather than generating symbols from a probabilistic function, creates structures that follow algorithmic schemes [14,15]. Successful implementations, using small Turing machines [16], have been proposed and implemented (for example [17,18]).

Methods

Results

Conclusion