Abstract

Recent studies uncover that subcellular location of long non-coding RNAs (lncRNAs) can provide significant information on its function. Due to the lack of experimental data, the number of lncRNAs is very limited, experimentally verified subcellular localization, and the numbers of lncRNAs located in different organelle are wildly imbalanced. The prediction of subcellular location of lncRNAs is actually a multi-classification small sample imbalance problem. The imbalance of data results in the poor recognition effect of machine learning models on small data subsets, which is a puzzling and challenging problem in the existing research. In this study, we integrate multi-source features to construct a sequence-based computational tool, lncLocation, to predict the subcellular location of lncRNAs. Autoencoder is used to enhance part of the features, and the binomial distribution-based filtering method and recursive feature elimination (RFE) are used to filter some of the features. It improves the representation ability of data and reduces the problem of unbalanced multi-classification data. By comprehensive experiments on different feature combinations and machine learning models, we select the optimal features and classifier model scheme to construct a subcellular location prediction tool, lncLocation. LncLocation can obtain an 87.78% accuracy using 5-fold cross validation on the benchmark data, which is higher than the state-of-the-art tools, and the classification performance, especially for small class sets, is improved significantly.

Highlights

  • 2% of the transcriptional products are translated into proteins, and the remaining 98% are non-coding RNAs

  • The research on non-coding RNAs mainly focuses on micro RNAs, circular RNAs, small interfering RNAs, PIWI-interacting RNAs, and long non-coding RNAs

  • Cells are divided into different organelles; various organelles have different divisions of labor and are responsible for the activities of cells with different functions, the information of subcellular localization of long non-coding RNAs (lncRNAs) can contribute to its function

Read more

Summary

Introduction

2% of the transcriptional products are translated into proteins, and the remaining 98% are non-coding RNAs. There are limited computational prediction methods for the subcellular localization of lncRNA, mainly including multi-classification of lncLocator and iLoc-lncRNA, which contain five subcellular localization regions and four subcellular localization regions, respectively, and DeepLncRNA based on binary classification, which contains two subcellular localization regions. We propose a novel multi-source heterogeneous feature fusion computational tool to predict the subcellular location of lncRNAs, lncLocation. To further improve the representation and reduce the impact of data imbalance, a computational framework of multi-source feature fusion is proposed to integrate deep feature learning based on an autoencoder, and hybrid feature selection based on recursive feature elimination and binomial distribution filtering. For convenience, an online web server is developed for researchers to use

The Effectiveness of Different Features
Method
Materials and Methods
Benchmark Dataset
K-Tuple Features
Basic lncRNA Features
Physicochemical Properties
Multi-Scale Secondary Structures
Feature Learning and Selection
Model Selection
Performance Evaluation
48. Structural Approaches to Sequence Evolution
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call