Authors:This paper is a continuation of some previous works by the authors. We consider various algorithms for calculating distances between genomes of similar species (we use primarily mitochondrial DNA, mtDNA) and various distance matrices between the same genomes obtained on the basis of these algorithms. We can say, just to simplify the situation a little, that all our publications on the subject of DNA analysis are associated with various applications of metrics set on such matrices. The paper also has a second subject, i.e., the study of the obtained distance matrices using special statistical characteristics. We consider two matrices obtained for the mtDNA of 32 species of monkeys; the species were selected so that they all belong to different genera. For them, we have obtained 2 matrices of distances between genomes corresponding to the Jaro –Winkler’s and Needleman –Wunsch’s algorithms. Next, we considered all the triangles obtained in these matrices, and for each of them we used a specially calculated badness. It is actually a measure of the deviation of the resulting triangle from some acuteangled isosceles one. For two sequences of such badness, we have considered variants of paired correlation. At the same time, in addition to the two standard pair correlation algorithms (Spearman’s and Kendall’s ones), we also considered a new algorithm proposed by us. The reason for considering this new algorithm is as follows. In the usual way of calculating the correlation, we consider only the set of pairwise values of two random variables, without taking into account the pairs themselves. Vice versa, in both of the mentioned pair correlation algorithms, despite their slight difference, we consider only the order of the elements in these pairs, not paying attention to the values themselves; we specifically note that this also applies to Spearman’s criterion, which is usually written about as being more suitable for measurements made on an ordinal scale. In our proposed algorithm, we tried to take into account both the value of both random variables and their order in pairs. The results obtained are of interest. Thus, the “pole” variants (i.e., the usual correlation formula and standard pair correlation algorithms) show some (though very small) correlation between two sequences of 4960 pairs of triangles: from 0.1 to 0.4, depending on the specific algorithm, on whether preliminary normalization was carried out, etc. And the “intermediate” variant (taking into account both the order of pairs and the values of random variables) showed a complete lack of connection: the absolute value of the correlation coefficient did not exceed 0.006. Even more interesting is another result obtained in the work, which can be called a small connection between two well-known algorithms for determining the distances between genomes, namely, algorithms of Jaro –Winkler and Needleman –Wunsch.
Read full abstract