Abstract
In this paper we present preliminary results stemming from a novel application of Markov Models and Support Vector Machines to splice site classification of Intron-Exon and Exon-Intron (5' and 3') splice sites. We present the use of Markov based statistical methods, in a log likelihood discriminator framework, to create a non-summed, fixed-length, feature vector for SVM-based classification. We also explore the use of Shannon-entropy based analysis for automated identification of minimal-size models (where smaller models have known information loss according to the specified Shannon entropy representation). We evaluate a variety of kernels and kernel parameters in the classification effort. We present results of the algorithms for splice-site datasets consisting of sequences from a variety of species for comparison.
Highlights
Introduction and backgroundWe are exploring hybrid methods where Markov-based statistical profiles, in a log likelihood discriminator framework, are used to create a fixed-length feature vector for Support Vector Machine (SVM) based classification
This analysis was critical to identifying information-rich sequence regions around the splice site locations, and are used in defining the positional range of positionally defined Markov Models (pMM's) needed in the SVM classification that follows
It is found that the positions identified in the low Entropy (lEnt) regions carry information about the splice site which a trained SVM can classify with high accuracy
Summary
Introduction and backgroundWe are exploring hybrid methods where Markov-based statistical profiles, in a log likelihood discriminator framework, are used to create a fixed-length feature vector for Support Vector Machine (SVM) based classification. The individual-observation log odds ratios are themselves constructed from positionally defined Markov Models (pMM's), so what results is a pMM/SVM sensor method. This method may have utility in a number of areas of stochastic sequential analysis that are being actively researched, including splice-site recognition and other types of gene-structure identification, file recovery in computer forensics ('file carving'), and speech recognition. We test our pMM/SVM method on an interesting discrimination problem in gene-structure identification: splicesite recognition In this situation the pMM/SVM approach leads to evaluation of the log odds ratio of an observed stochastic sequence, for splice-site and not, by Chow expansion decomposition, with vectorization rather than sum of the log odds ratios of the conditional probabilities on individual observations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.