FexRNA: Exploratory Data Analysis and Feature Selection of Non-Coding RNA.

Noorul Amin,Yi-Ping Phoebe Chen,Annette Mcgrath

doi:10.1109/tcbb.2021.3057128

Abstract

Non-coding RNA (ncRNA) is involved in many biological processes and diseases in all species. Many ncRNA datasets exist that provide ncRNA data in FASTA format which is well suited for biomedical purposes. However, for ncRNA analysis and classification, statistical learning methods require hidden numerical features from the data. Furthermore, in the literature, a wealth of sequence intrinsic features has been proposed for ncRNA identification. The extraction of hidden features, their analysis, and usage of a suitable set of features is crucial for the performance of any statistical learning method. To alleviate the posed challenges, we generated 96 feature datasets from ncRNA widely used features. The feature datasets are based on RNACentral and consist of species, ncRNA types, and expert databases that are available on the FexRNA platform. Additionally, the feature datasets are explored and analysed to provide statistical information, univariate, and bivariate analysis. We sought to determine which of these 17 features would be most appropriate to use in developing ncRNA classification approaches. For feature selection (FS), a two-phase hierarchical FS framework based on correlation and majority voting is proposed and evaluated on 5 species. The FexRNA platform provides information about ncRNA feature analysis and selection.

Full Text