Abstract
BackgroundSplice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction.ResultsUsing a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy.ConclusionsBased on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions.ReviewersThis article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.
Highlights
Splice sites prediction has been a long-standing problem in bioinformatics
Advantage with the short window size of 11 bp Based on Homo Sapiens Splice Sites Dataset (HS3D)-train1:1 and HS3D-test1:1, the independent tests were performed to compare the performance of Chi-square decision table (χ2-DT) using various window sizes
Χ2-DT is clearly superior to the methods for comparison
Summary
Splice sites prediction has been a long-standing problem in bioinformatics. many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Results: Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. If we can accurately detect splice sites, the coding regions of DNA sequences can be located, so splice site prediction plays a key role in gene identification. Almost 99% of splice sites are canonical GT–AG pairs [2], that is, dinucleotides GT and AG for donor and acceptor splice sites, respectively. We face an extremely imbalanced classification task, namely, the discrimination of small numbers of true splice sites from much larger volumes of decoy positions with the dinucleotides GT and AG [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.