Sequence similarity using composition method

Deepti Gaur,Geetika Munjal,Pooja Sharma

doi:10.1504/ijds.2018.10011822

Abstract

Deoxyribo nucleic acid (DNA) has enormous capacity to carry very important information in the form of character strings. Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding the structure, feature or evolution of these nucleotides strings. The analysis uses mathematical methods to convert these character strings to numerical values, and these numerical values are used to find similarity between the sequences. DNA sequences only contain four nucleotides: A, C, G and T, but in order to find information from these sequences, sequence comparison becomes essential. In this paper, various methods to analyse DNA sequences including usage of entropy, divergence, LZ complexity and the role of hybridisation are explored. A hybrid model based on the composition vector and distance methods is proposed to find dissimilarity between sequences and this hybrid model is tested on sequences of species downloaded from National Center for Biotechnology Information (NCBI).

Full Text