Abstract

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.