Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features

Jun Wang,Liangjiang Wang

doi:10.1186/s12859-020-03843-5

Jun Wang, Liangjiang Wang

Open Access

https://doi.org/10.1186/s12859-020-03843-5

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Nov 7, 2020
Citations: 14	License type: open-access

Affiliation: Clemson University, Center for Human Genetics

Abstract

BackgroundAutism spectrum disorders (ASD) refer to a range of neurodevelopmental conditions, which are genetically complex and heterogeneous with most of the genetic risk factors also found in the unaffected general population. Although all the currently known ASD risk genes code for proteins, long non-coding RNAs (lncRNAs) as essential regulators of gene expression have been implicated in ASD. Some lncRNAs show altered expression levels in autistic brains, but their roles in ASD pathogenesis are still unclear.ResultsIn this study, we have developed a new machine learning approach to predict candidate lncRNAs associated with ASD. Particularly, the knowledge learnt from protein-coding ASD risk genes was transferred to the prediction and prioritization of ASD-associated lncRNAs. Both developmental brain gene expression data and transcript sequence were found to contain relevant information for ASD risk gene prediction. During the pre-training phase of model construction, an autoencoder network was implemented for a representation learning of the gene expression data, and a random-forest-based feature selection was applied to the transcript-sequence-derived k-mers. Our models, including logistic regression, support vector machine and random forest, showed robust performance based on tenfold cross-validations as well as candidate prioritization with hypothetical loci. We then utilized the models to predict and prioritize a list of candidate lncRNAs, including some reported to be cis-regulators of known ASD risk genes, for further investigation.ConclusionsOur results suggest that ASD risk genes can be accurately predicted using developmental brain gene expression data and transcript sequence features, and the models may provide useful information for functional characterization of the candidate lncRNAs associated with ASD.

Highlights

Autism spectrum disorders (ASD) refer to a range of neurodevelopmental conditions, which are genetically complex and heterogeneous with most of the genetic risk factors found in the unaffected general population
To reduce the high dimensionality of input features, an autoencoder network was implemented for the representation learning of gene expression data, and an random forest (RF)-based method was used for the selection of sequence features important for classification
With the updated training dataset, we first examined the performance of logistic regression (LR), support vector machine (SVM) and RF models using the same set of gene expression features (BrainSpan_full)

Summary

Introduction

Autism spectrum disorders (ASD) refer to a range of neurodevelopmental conditions, which are genetically complex and heterogeneous with most of the genetic risk factors found in the unaffected general population. All the currently known ASD risk genes code for proteins, long non-coding RNAs (lncRNAs) as essential regulators of gene expression have been implicated in ASD. Autism spectrum disorders (ASD) refer to a broad range of neurodevelopmental conditions characterized by symptoms of having difficulties in social interactions, verbal and non-verbal communications, and showing repetitive behaviors. All the known ASD risk genes code for proteins, and some de novo mutations that likely disrupt protein-coding genes have been shown to cause ASD [3,4,5]. A recent analysis based on 1,790 ASD simplex families has revealed that the vast majority of de novo mutations are located in non-coding regions and linked with the IQ heterogeneity of ASD probands [3]

Methods

Results

Discussion

Conclusion