Abstract
BackgroundDetection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites.ResultsThe performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset.ConclusionBased on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure.Electronic supplementary materialThe online version of this article (doi:10.1186/s13040-016-0086-4) contains supplementary material, which is available to authorized users.
Highlights
Detection of splice sites plays a key role for predicting the gene structure and development of efficient analytical methods for splice site prediction is vital
The downloaded dataset contains a total of 2796 True donor Splice Sites (TSS) and 90924 False donor Splice Site (FSS)
It is seen that none of the existing approaches achieved above 90 % True Positive Rate (TPR)
Summary
Detection of splice sites plays a key role for predicting the gene structure and development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. Prediction of gene structures is one of the important tasks in genome sequencing projects, and the prediction of exon-intron boundaries or splice sites (ss) is crucial for predicting the structures of genes in eukaryotes. It has been established that accurate prediction of eukaryotic gene structure highly depends upon the ability to accurately identify the ss. The ss at the exon-intron boundaries are called the donor (5′) ss whereas intron-exon boundaries are called the acceptor (3′) ss. The donor and acceptor ss with consensus GT (at intron-start) and AG (at intron-end) respectively are known as canonical ss (GT-AG type; Fig. 1). As GT-and AG-are conserved in donor and acceptor ss respectively, every GT
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.