Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences.

Kun Tian,Changchuan Yin,Rong L He,Xiaoqian Yang,Stephen S.-T Yau,Qin Kong,Yang Zhang

doi:10.1371/journal.pone.0136577

Abstract

Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.

Highlights

Comparison of DNA sequences or protein sequences is a problem that has been studied in biological sciences for years
We apply the Yau-Hausdorff method by comparing the DNA sequences of the c oxidase I (COI) genes, barcoding, H1N1, and the Influenza virus neuraminidase (NA) genes to verify the accuracy of our method on its ability to cluster genomes
The distance provided by natural vector method is Euclidean distance of the vectors presented by DNA sequences in 12-dimensional space R12, while Yau-Hausdorff method is based on calculating the minimum Hausdorff distance of point sets coming from the graphical representation of sequences

Summary

Introduction

Comparison of DNA sequences or protein sequences is a problem that has been studied in biological sciences for years. Many approaches have been proposed for measuring the similarity between DNA sequences and protein sequences, including multiple sequence alignment [1], moment vectors [2] and feature vectors [3]. The distance between sequences is defined to be the Euclidean distance between their corresponding vectors This approach is effective and operates in linear time. Liu et al developed the Python package for generating various modes of feature vectors for sequences [4]. This method depends on fifteen types of feature vectors of sequence, which can be extremely large for computing DNA sequences of long lengths. It offers new tools to address large-scale data for multiple sequence alignment

Methods

Results

Discussion

Conclusion