ISS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder

Zhao-Chun Xu,Xuan Xiao,Peng Wang,Wang-Ren Qiu

doi:10.1038/s41598-017-08523-8

Abstract

Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.

Highlights

The pre-messenger RNA (mRNA), including exons and one or more introns, is transcribed from a eukaryotic gene’s DNA template
Encouraged by the above successes of introducing this feature extraction approach into computational proteomics, we use twelve physical-chemical properties of the dinucleotides within DNA via a battery of cross-covariance and auto-covariance transformations to obtain a mode of pseudo dinucleotide composition (PseDNC) to formulate given sequence samples
Feature extraction is the key problem in the research on bioinformatics

Summary

Introduction

The pre-mRNA, including exons and one or more introns, is transcribed from a eukaryotic gene’s DNA template. In 2016, M Iqbal et al.[7] used PseTNC and PseTetraNC methods to propose a hybrid prediction model, called iSS-Hyb-mRMR, for identifying splice sites, and Prabina Kumar Meher[8] used a hybrid feature extraction approach, which contains positional, dependency and compositional features, to develop a predictor called HSplice for predicting the donor splice sites in eukaryotic genes. Encouraged by the above successes of introducing this feature extraction approach into computational proteomics, we use twelve physical-chemical properties of the dinucleotides within DNA via a battery of cross-covariance and auto-covariance transformations to obtain a mode of PseDNC to formulate given sequence samples

Methods

Results

Conclusion