Abstract
The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.
Highlights
The emerging recognition of RNA is that any transcripts, regardless of protein coding potential, can have intrinsic functions [1]
This paper aims to develop a new sequence distributed representation learning based method for novel long non-coding RNA (lncRNA)-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between biological sequences and natural languages [22]
We proposed a novel lncRNA-protein interactions prediction model LPI-Pred, based on sequence distributed representation learning and Gini information impurity measure
Summary
The emerging recognition of RNA is that any transcripts, regardless of protein coding potential, can have intrinsic functions [1]. One kind of this transcripts that are no less than 200 nucleotides, known as long non-coding RNA (lncRNA). Existing studies demonstrate that only less than 2% of the human genome can be translated into proteins, whereas more than 80% of it has biochemical functions [2,3]. The lncRNA often act through functions by binding to partner proteins, and play critical roles in gene regulation, splicing, translation, chromatin modification and poly-adenylation [5,6,7,8]. Emerging evidences have revealed that various complex diseases have strong correlation with lncRNAs, such as Alzheimer [9], lung cancer
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Computational and Structural Biotechnology Journal
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.