Abstract

BackgroundIdentification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species.ResultsEach splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets.ConclusionsThe proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.

Highlights

  • Identification of splice sites is essential for annotation of genes

  • Several computational methods have been proposed for the prediction of splice sites, and those can be broadly categorized into two classes, namely, probabilistic approach and machine learning based approach [6]

  • The positional features were similar to the scores of weighted matrix model (WMM) and Shapiro-Senapathy score, whereas the dependency features were similar to the scores of earlier developed probabilistic approaches i.e., weighted array model (WAM) and SAE

Read more

Summary

Introduction

Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Several computational methods have been proposed for the prediction of splice sites, and those can be broadly categorized into two classes, namely, probabilistic approach and machine learning based approach [6]. In the class of machine learning approaches, support vector machine (SVM) has been used more successfully for the prediction of splice sites [4]. Baten et al [7] generated the features based on first order Markov model and used them as input in SVM for splice site prediction by applying polynomial kernel. Besides SVM, the Naïve Baye’s classifier has been successfully used by Kamath et al [2] for the prediction of splice sites in which an automated feature generation program has been developed

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call