Using machine learning to detect the differential usage of novel gene isoforms

Xiaopu Zhang,James G D Prendergast,Musa A Hassan

doi:10.1186/s12859-022-04576-3

Xiaopu Zhang, James G D Prendergast + Show 1 more

Open Access

https://doi.org/10.1186/s12859-022-04576-3

Copy DOI

Journal: BMC bioinformatics	Publication Date: Jan 18, 2022
Citations: 1	License type: open-access

Affiliation: Roslin Institute, University of Edinburgh

Abstract

BackgroundDifferential isoform usage is an important driver of inter-individual phenotypic diversity and is linked to various diseases and traits. However, accurately detecting the differential usage of different gene transcripts between groups can be difficult, in particular in less well annotated genomes where the spectrum of transcript isoforms is largely unknown.ResultsWe investigated whether machine learning approaches can detect differential isoform usage based purely on the distribution of reads across a gene region. We illustrate that gradient boosting and elastic net approaches can successfully identify large numbers of genes showing potential differential isoform usage between Europeans and Africans, that are enriched among relevant biological pathways and significantly overlap those identified by previous approaches. We demonstrate that diversity at the 3′ and 5′ ends of genes are primary drivers of these differences between populations.ConclusionMachine learning methods can effectively detect differential isoform usage from read fraction data, and can provide novel insights into the biological differences between groups.

Highlights

The number of unique mRNA isoforms encoded by the human genome is estimated to be 3–10 times higher than the number of genes [1, 2]
Population RNA‐seq datasets To investigate the ability of machine learning (ML) approaches to detect differential isoform usage from windowed read proportions we used the GEUVADIS RNA sequencing (RNA-seq) dataset generated by Lappalainen et al [13]
In this study we aimed to investigate whether ML approaches may have utility in detecting differential isoform usage based purely on the distribution of reads across a gene region, even when the precise isoforms are unknown

Summary

Introduction

The number of unique mRNA isoforms encoded by the human genome is estimated to be 3–10 times higher than the number of genes [1, 2]. This transcript diversity enables increased downstream phenotypic complexity through the expansion of the set of proteins encoded by a comparatively small set of genes [3, 4]. Conclusion: Machine learning methods can effectively detect differential isoform usage from read fraction data, and can provide novel insights into the biological differences between groups

Objectives

Methods

Results

Conclusion