Abstract

Similarity analysis of DNA sequences can clarify the homology between sequences and predict the structure of, and relationship between, them. At the same time, the frequent patterns of biological sequences explain not only the genetic characteristics of the organism, but they also serve as relevant markers for certain events of biological sequences. However, most of the aforementioned biological sequence similarity analysis methods are targeted at the entire sequential pattern, which ignores the missing gene fragment that may induce potential disease. The similarity analysis of such sequences containing a missing gene item is a blank. Consequently, some sequences with missing bases are ignored or not effectively analyzed. Thus, this paper presents a new method for DNA sequence similarity analysis. Using this method, we first mined not only positive sequential patterns, but also sequential patterns that were missing some of the base terms (collectively referred to as negative sequential patterns). Subsequently, we used these frequent patterns for similarity analysis on a two-dimensional plane. Several experiments were conducted in order to verify the effectiveness of this algorithm. The experimental results demonstrated that the algorithm can obtain various results through the selection of frequent sequential patterns and that accuracy and time efficiency was improved.

Highlights

  • In recent years, a large volume of biological sequence data has been generated

  • Because the DNA sequence corresponds to its time series one to one, the similarity of the DNA

  • We compared the results of the frequent patterns mining of the first exon of the β-protein gene of the 10 different species based on our proposed graphical representation

Read more

Summary

Introduction

When a new DNA sequence is obtained, similarity analysis is used in order to determine whether it is similar to a known sequence. If it is homologous, this will save time and effort in re-determining the function of the new sequence. Similarity analysis of biological sequences is by no means a straightforward mechanical comparison. Alignment and classical research methods are the most common. Two problems exist that directly affect the similarity score: the substitution matrix and gap penalty. Gap penalty is used to compensate the influence of insertion and deletion on sequence similarity and no suitable theoretical model exists to describe the slot problem. Vacancy penalty points lack a functional theoretical basis and are subjectivity

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call