DNA Sequence Homology Recognition based on Similarity Measurement

Chenhui Yang,Junyan Zhang,Xiaodan Chen

doi:10.2991/isrme-15.2015.24

Abstract

DNA sequence homology recognition is a key problem in bioinformatics. In this paper, we solve this problem by use of the probability method instead of traditional sequence alignment because DNA character sequence satisfies the Markov properties. Hence, second order Markov model is used as the characteristic of DNA sequence. The similarity measurement is defined based on two-step transition probability. And then our SHR algorithm is put forward. The contrast experiments show that SHR algorithm can recognize DNA sequence homology correctly in higher processing speed. Introduction Generally, a DNA sequence is treated as a long string of characters with a four-character set ∑={A, C, G, T}. Thus, any one DNA sequence S∈∑*[1]. DNA sequence homology recognition is an important problem in bioinformatics, which refers that two or more DNA sequences are compared through some mathematical algorithms so as to determine homology on the basis of the similarity [2]. There are many ways to solve this problem, such as: (1) 2-D or 3-D graphics are employed to represent DNA sequences so as to analyze the relationship among DNA sequences [3], which has better visual effect but lower speed. (2) The four alphabets of DNA sequences in ∑ are mapped into numerical sequences and their features are compared by numerical analysis [4], which can make us obtain better predictive effect but lack of a unified measurement. (3) DNA sequences are regarded as character strings or texts, and the relative distances are adopted to analyze DNA sequences [5]. Thus, the methods of text compression can be introduced to improve speed, but some redundant sequences still exist. (4) Non-alignment methods are put to use for analyzing features of DNA sequences in order to improve efficiency though the segmentation and positioning are difficult to be achieved [6]. The number of DNA sequences is usually very large and their structures are very complicated. Therefore, the existed algorithms have their own advantages and disadvantages respectively. In this paper, we concentrate on DNA sequence homology recognition based on similarity measurement by use of second order Markov Model [7]. The remainder of this paper is organized as follows. First of all, the relative concepts and definitions are presented. And then, the description of the problem-solving ideas and our SHR algorithm is put forward. After that, the contrast experiments and results are listed. Finally, we conclude this paper. Concepts and Definitions Second Order Markov Model. Markov model is one of the most important stochastic processes, and it is widely applied to modern biology, physics, business, geology, atmospherics, and so on. Definition 1. Let ρ(n)={xn, n∈T} be a stochastic process of discrete states with state space I and of non-negative integer parameters n. If ρ(n) satisfies condition: P{x(n+1)=in+1| x(0)=i0, x(1)=i1, ..., x(n)=in}, ρ(n) is called a Markov chain, where, i0, i1, ..., in∈I. Definition 2. Conditional probability pij(n)=p{xn+1=j| xn=i} (i,j∈I, n≥0) is called one-step transition probability of Markov chain with state space I. Let P be one-step transition probability matrix which is made of pij, and P=(pij). International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) © 2015. The authors Published by Atlantis Press 96 Definition 3. Conditional probability ) 2 ( ij p =p{xm+1=k| xm=g}(k,g∈I, m≥0) is called two-step transition probability of Markov chain with the state space I. And P=( ) 2 ( ij p ) is named two-step transition probability matrix, where ) 2 ( ij p ≥0, and ∑ ∈I j ij p ) 2 ( =1. Definition 4. Second order Markov model can be denoted by λ=(π, P, P), where π=(π1, π2,...πk) is the initial state and k is the number of possible states of the sequences. Characteristic matrix of DNA Sequence. For all the DNA sequences, we have I=∑. The next base has nothing to do with the last one. Therefore, DNA sequence can be treated as Markov chain and it also can be described by two-step transition probability. Hence, each DNA sequence is corresponding to one P which is looked on as its characteristic matrix. So we have:

Full Text