A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation.

Adam Mcdermaid,Yiran Zhang,Shaopeng Gu,Xin Chen,Qin Ma,Juan Xie,Cankun Wang

doi:10.3389/fgene.2018.00313

Adam Mcdermaid, Yiran Zhang + Show 5 more

Open Access

https://doi.org/10.3389/fgene.2018.00313

Copy DOI

Abstract

One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html.

Highlights

RNA-Seq is a revolutionary high-throughput process that allows researchers to observe the genetic makeup of a particular sample (Wang et al, 2009; Garber et al, 2011; Ozsolak and Milos, 2011) and can assist in determination of regulatory mechanisms and transcription unit prediction (Chou et al, 2015; Chen et al, 2017)
To address issue of mapping uncertainty, we present the machine learning-based tool GeneQC (Figure 1), which uses extracted multi-level features combined with novel applications of regularized regression and mixture model fitting approaches to quantify the mapping uncertainty issue (McDermaid et al, 2018b)
GeneQC takes as inputs three pieces of information that are found in most RNA-Seq analysis pipelines: (1) the read mapping result SAM file; (2) the fasta reference genome corresponding to the to-be-analyzed species; and (3) the species-specific annotation general feature format file (Figure 1B)

Summary

Introduction

RNA-Seq is a revolutionary high-throughput process that allows researchers to observe the genetic makeup of a particular sample (Wang et al, 2009; Garber et al, 2011; Ozsolak and Milos, 2011) and can assist in determination of regulatory mechanisms and transcription unit prediction (Chou et al, 2015; Chen et al, 2017). The nature of DNA—long strands of millions of base-pairs created by a reordering of the four nucleotides—makes it inevitable that some similarities and duplications will occur throughout the genome This can lead to ambiguity during read mapping, with specific reads being aligned to multiple locations across the reference genome with the same alignment scores (Li et al, 2009; Oshlack et al, 2010; Swan, 2013; Trapnell et al, 2013; Baruzzo et al, 2017)

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Aug 14, 2018
Citations: 21	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Bacterial profile of pork from production to retail based on high-throughput sequencing
Yejin Kim ... Sun Ae Kim
Food Research International | VOL. 176
Yejin Kim, et. al.Yejin Kim ... Sun Ae Kim
01 Dec 2023
Food Research International | VOL. 176

Towards reliable isoform quantification using RNA-SEQ data
Brian E Howard ... Steffen Heber
BMC Bioinformatics | VOL. 11
Brian E Howard, et. al.Brian E Howard ... Steffen Heber
01 Apr 2010
BMC Bioinformatics | VOL. 11

Trimming of sequence reads alters RNA-Seq gene expression estimates.
Claire R Williams ... Alyssa Baccarella
BMC Bioinformatics | VOL. 17
Claire R Williams, et. al.Claire R Williams ... Alyssa Baccarella
25 Feb 2016
BMC Bioinformatics | VOL. 17

Surf and turf: A dataset of stable isotope values of plants and animals from southern California
Mikael Fauvelle ... Andrew D Somerville
Data in Brief | VOL. 38
Mikael Fauvelle, et. al.Mikael Fauvelle ... Andrew D Somerville
20 Sep 2021
Data in Brief | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics