A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data.

Prabina Kumar Meher,Tanmaya Kumar Sahu,Sant Dass Wahi,Atmakuri Ramakrishna Rao

doi:10.1186/s12859-014-0362-6

Prabina Kumar Meher, Tanmaya Kumar Sahu + Show 2 more

Open Access

https://doi.org/10.1186/s12859-014-0362-6

Copy DOI

Abstract

BackgroundMost of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study.ResultsAn approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy.ConclusionsThe proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-014-0362-6) contains supplementary material, which is available to authorized users.

Highlights

Most of the approaches for splice site prediction are based on machine learning techniques
Associations among nucleotides Here, we propose an approach for finding associations among nucleotides in the splice site motifs and is explained as follows: Consider a sequence dataset having N sequences of equal length P and let Sk = (x1k, x2k, ..., xPk), xik ∈ {A, T, G, C} ; ∀ i = 1, 2, ..., P be the kth sequence
It is observed that most of the associations are found between 29–64 units, which corresponds to position number 8–16 out of considered 20 positions in the motif

Summary

Introduction

Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. These approaches provide high prediction accuracy, the window lengths used are longer in size These approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from generation sequencing technologies. To utilize short reads generated from the generation sequencing technology for transcriptome sequencing and gene structure identification, one need to align accurately the sequence reads over intron boundaries and splice site prediction helps to improve the alignment quality [3]. It is required to develop methodology to predict splice variants using short reads or sequence with short window size

Methods

Results

Discussion

Conclusion