Machine Learning-Based Ensemble Recursive Feature Selection of Circulating miRNAs for Cancer Tumor Classification.

Alejandro Lopez-Rincon,Lucero Mendoza-Maldonado,Johan Garssen,Alexander Schönhuth,Aletta D Kraneveld,Marlet Martinez-Archundia,Alberto Tonda

doi:10.3390/cancers12071785

Abstract

Circulating microRNAs (miRNA) are small noncoding RNA molecules that can be detected in bodily fluids without the need for major invasive procedures on patients. miRNAs have shown great promise as biomarkers for tumors to both assess their presence and to predict their type and subtype. Recently, thanks to the availability of miRNAs datasets, machine learning techniques have been successfully applied to tumor classification. The results, however, are difficult to assess and interpret by medical experts because the algorithms exploit information from thousands of miRNAs. In this work, we propose a novel technique that aims at reducing the necessary information to the smallest possible set of circulating miRNAs. The dimensionality reduction achieved reflects a very important first step in a potential, clinically actionable, circulating miRNA-based precision medicine pipeline. While it is currently under discussion whether this first step can be taken, we demonstrate here that it is possible to perform classification tasks by exploiting a recursive feature elimination procedure that integrates a heterogeneous ensemble of high-quality, state-of-the-art classifiers on circulating miRNAs. Heterogeneous ensembles can compensate inherent biases of classifiers by using different classification algorithms. Selecting features then further eliminates biases emerging from using data from different studies or batches, yielding more robust and reliable outcomes. The proposed approach is first tested on a tumor classification problem in order to separate 10 different types of cancer, with samples collected over 10 different clinical trials, and later is assessed on a cancer subtype classification task, with the aim to distinguish triple negative breast cancer from other subtypes of breast cancer. Overall, the presented methodology proves to be effective and compares favorably to other state-of-the-art feature selection methods.

Highlights

MicroRNAs are noncoding RNA molecules of 18–25 nucleotides in length that regulate the expression of more than one third of human genes [1,2]
While a similar technique was presented in [21,22], the approach we propose features several improvements and important innovations that set it apart from previous contributions: (i) previous works did not select for circulating miRNAs, and resulting signatures could not be measured in clinical practice; (ii) previous techniques needed extra parameters to be defined by the user, while the novel approach we propose does not require users to arbitrarily set values for thresholds; and (iii) the amount of data used in the experimental verification greatly increased, getting a total of 16 gene expression omnibus (GEO) datasets
From the feature selection algorithm, we reduced the original 253 miRNA to 5, while maintaining an average accuracy of 90% over the selected classifiers (Figure 2)

Summary

Introduction

MicroRNAs (miRNAs) are noncoding RNA molecules of 18–25 nucleotides in length that regulate the expression of more than one third of human genes [1,2]. Since the discovery of the first miRNA in Caenorhabditis elegans [3], these molecules have been found in many organisms and tissue types. The pre-miRNA is cleaved by the Dicer/TRBP complex to create miRNA that represses or degrades the target mRNAs [7,8]. This machinery is altered in cancer cells, perturbing miRNA expression and accelerating the process of tumorigenesis. Because the histological examination of tissues is an invasive and comparatively risky procedure, studying miRNAs in biological fluids offers a useful alternative for diagnosis, typing and management of cancer patients

Objectives

Methods

Results

Discussion

Conclusion