Comprehensive machine-learning-based analysis of microRNA\u2013target interactions reveals variable transferability of interaction rules across species

Gilad Ben Or,Isana Veksler-Lublinsky

doi:10.1186/s12859-021-04164-x

Abstract

BackgroundMicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs (mRNAs). Due to the technical challenges involved in the application of high-throughput experimental methods, datasets of direct bona fide miRNA targets exist only for a few model organisms. Machine learning (ML)-based target prediction models were successfully trained and tested on some of these datasets. There is a need to further apply the trained models to organisms in which experimental training data are unavailable. However, it is largely unknown how the features of miRNA–target interactions evolve and whether some features have remained fixed during evolution, raising questions regarding the general, cross-species applicability of currently available ML methods.ResultsWe examined the evolution of miRNA–target interaction rules and used data science and ML approaches to investigate whether these rules are transferable between species. We analyzed eight datasets of direct miRNA–target interactions in four species (human, mouse, worm, cattle). Using ML classifiers, we achieved high accuracy for intra-dataset classification and found that the most influential features of all datasets overlap significantly. To explore the relationships between datasets, we measured the divergence of their miRNA seed sequences and evaluated the performance of cross-dataset classification. We found that both measures coincide with the evolutionary distance between the compared species.ConclusionsThe transferability of miRNA–targeting rules between species depends on several factors, the most associated factors being the composition of seed families and evolutionary distance. Furthermore, our feature-importance results suggest that some miRNA–target features have evolved while others remained fixed during the evolution of the species. Our findings lay the foundation for the future development of target prediction tools that could be applied to “non-model” organisms for which minimal experimental data are available.Availability and implementationThe code is freely available at https://github.com/gbenor/TPVOD.

Highlights

MicroRNAs are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs
The mature, functional miRNAs associate with Argonaute proteins to form the core of the miRNA-induced silencing complex. miRISC uses the sequence information in the miRNA as a guide to recognize and bind partially complementary sequences on the 3’ untranslated region (UTR) of target messenger RNAs (mRNAs). miRISC binding typically leads to the translational inhibition and/or the degradation of targeted mRNAs [2]. miRNAs are evolutionarily conserved and are present in the genomes of animals, plants and viruses [3]. miRNAs have diverse developmental and physiological functions and they have been implicated in numerous human diseases [4]
The pipeline produced final datasets of various sizes: four small datasets (500–1200), two large datasets (2000–5000), and two massive datasets (∼ 18,000 each). As these final datasets were later used as input for machine learning (ML) tasks, we complemented them with synthetically generated negative interactions, as described in the Methods: Generation of negative interactions

Summary

Introduction

MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs (mRNAs). Several experimental highthroughput methods for identifying miRNA targets have been developed in recent years [5, 6], of which the most common and straightforward approach is based on measuring changes in mRNA levels following miRNA over-expression or inhibition in tissuecultured cells [7] This approach has several major limitations [5, 6]. For direct regulation, the exact sequences of binding sites are unknown and must be predicted within the relevant mRNA sequence Such experimental settings may represent a non-physiological context for miRNA activity, which does not reflect endogenous targeting rules. This approach may miss signals of translation-efficiency inhibitions, which affect gene expression but are not reflected in changes in mRNA levels [8]

Objectives

Methods

Results

Discussion

Conclusion