Abstract

Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to prevent or treat these neurodegenerative diseases has been slow, partly because the genes associated with these diseases are incompletely understood. A notable hindrance from data analysis perspective is that, usually, the clinical samples for patients and controls are highly imbalanced, thus rendering it challenging to apply most existing machine learning algorithms to directly analyze such datasets. Meeting this data analysis challenge is critical, as more specific disease-associated gene identification may enable new insights into underlying disease-driving mechanisms and help find biomarkers and, in turn, improve prospects for effective treatment strategies. In order to detect disease-associated genes based on imbalanced transcriptome-wide data, we proposed an integrated multiple random forests (IMRF) algorithm. IMRF is effective in differentiating putative genes associated with subjects having LATE and/or AD from controls based on transcriptome-wide data, thereby enabling effective discrimination between these samples. Various forms of validations, such as cross-domain verification of our method over other datasets, improved and competitive classification performance by using identified genes, effectiveness of testing data with a classifier that is completely independent from decision trees and random forests, and relationships with prior AD and LATE studies on the genes linked to neurodegeneration, all testify to the effectiveness of IMRF in identifying genes with altered expression in LATE and/or AD. We conclude that IMRF, as an effective feature selection algorithm for imbalanced data, is promising to facilitate the development of new gene biomarkers as well as targets for effective strategies of disease prevention and treatment.

Highlights

  • Dementia represents a set of slowly progressing neurodegenerative disorders with enormous public health impact, caused by a number of different underlying diseases

  • We used integrated multiple random forests (IMRF) to identify 31 genes with disease-related differential expression. By ranking these genes, using ANOVA to calculate the p-value of each IMRF-selected gene, and relating them to prior neurodegeneration and aging studies in Table 4, we demonstrated that IMRF was effective at identifying informative genes potentially associated with neurodegenerative diseases

  • Though about half of the top-ranked genes were already implicated in neuropathology such as Alzheimer’s disease (AD) by prior studies in the literature, to the best of our knowledge, the remaining genes have not been reported for associations with neurodegenerative diseases

Read more

Summary

Introduction

Dementia represents a set of slowly progressing neurodegenerative disorders with enormous public health impact, caused by a number of different underlying diseases. Limbic-predominant age-related TDP-43 encephalopathy (LATE) was defined [1]. LATE mimics AD-type dementia syndrome; LATE may be presented in isolation, or it could be comorbid with AD [2]. Existing research has revealed that AD, as a chronic age-related neurodegenerative disease, usually starts slowly and the cognitive deterioration of LATE is even slower than AD individually; AD-LATE comorbid disease typically causes a more rapid clinical decline than either of them individually. There are no effective techniques to confidently diagnose LATE or distinguish LATE from AD with clinically available biomarkers, including disease-associated genes.

Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call