Abstract

In recent years, much research has focused on using machine learning (ML) for disease prediction based on gene expression (GE) data. However, many diseases have received considerable attention, whereas some, including Alzheimer’s disease (AD), have not, perhaps due to data shortage. The present work is intended to fill this gap by introducing a symmetric framework to predict AD from GE data, with the aim to produce the most accurate prediction using the smallest number of genes. The framework works in four stages after it receives a training dataset: pre-processing, gene selection (GS), classification, and AD prediction. The symmetry of the model is manifested in all of its stages. In the pre-processing stage gene columns in the training dataset are pre-processed identically. In the GS stage, the same user-defined filter metrics are invoked on every gene individually, and so are the same user-defined wrapper metrics. In the classification stage, a number of user-defined ML models are applied identically using the minimal set of genes selected in the preceding stage. The core of the proposed framework is a meticulous GS algorithm which we have designed to nominate eight subsets of the original set of genes provided in the training dataset. Exploring the eight subsets, the algorithm selects the best one to describe AD, and also the best ML model to predict the disease using this subset. For credible results, the framework calculates performance metrics using repeated stratified k-fold cross validation. To evaluate the framework, we used an AD dataset of 1157 cases and 39,280 genes, obtained by combining a number of smaller public datasets. The cases were split in two partitions, 1000 for training/testing, using 10-fold CV repeated 30 times, and 157 for validation. From the testing/training phase, the framework identified only 1058 genes to be the most relevant and the support vector machine (SVM) model to be the most accurate with these genes. In the final validation, we used the 157 cases that were never seen by the SVM classifier. For credible performance evaluation, we evaluated the classifier via six metrics, for which we obtained impressive values. Specifically, we obtained 0.97, 0.97, 0.98, 0.945, 0.972, and 0.975 for the sensitivity (recall), specificity, precision, kappa index, AUC, and accuracy, respectively.

Highlights

  • Alzheimer’s disease (AD) is the most common cause of dementia and memory loss, in addition to being a major cause of death

  • A comprehensive framework to diagnose AD from gene expression (GE) data; A novel gene selection (GS) methodology based on hybrid filter/wrapper selection methods; The use of 6 different performance metrics to evaluate the proposed framework; High-performance exceeding, as demonstrated by experimental results, state of the art GE-based AD prediction frameworks; An enrichment to the literature on AD prediction based GE data, which is admittedly poor compared to the literature on other diseases

  • We have presented a framework for the prediction of a disease that has not found enough attention in the literature—Alzheimer’s disease (AD), using GE data

Read more

Summary

Introduction

Alzheimer’s disease (AD) is the most common cause of dementia and memory loss, in addition to being a major cause of death. The diagnosis of various diseases is nowadays possible thanks to gene expression, which is the basis of the present work Such data are obtained through the powerful technology of DNA microarrays [19]. The authors in [22] use a blood-derived gene expression biomarkers to distinguish AD cases from other sick and healthy cases They use XGBoost classification models and succeed in detecting AD in a heterogeneous aging population by adding related. In [27], the authors use blood gene expression data obtained from the ANM and dementia case registry (DCR) cohorts. They employed recursive feature elimination for GS and used RF for classifying AD cases.

Method
Materials and Methods
Integration of Datasets
Preprocessing
Gene Selection (GS)
Classification
Experimental Work
29 Train each classifier i on each of the 8 constructed gene subsets
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.