Abstract

BackgroundRecent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive.ResultsThe proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases.ConclusionWe proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.

Highlights

  • Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data

  • As MM0 and WMM0 imply the same model, we refer the integration of these two models with support vector machine (SVM) as WMM0/ MM0-SVM

  • We observed that MM1-SVM and WMM1-SVM are the best predictive models in the identification of both acceptor and donor splice sites, and the performance of WMM0/MM0-SVM is the worst

Read more

Summary

Introduction

Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies Their implementation requires estimating a large number of parameters, which is computationally expensive. It was statistically estimated that the number of genes in human genome should be around 100,000 [2] This difference shows that either a large number of genes are yet to be identified or there are many alternative splicing events yet to be detected [3,4]. Despite of many years of intensive research in this area, the overall performance of the gene prediction algorithms is still not satisfactory [5,6]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call