Addressing the missing data challenge in multi-modal datasets for the diagnosis of Alzheimer’s disease

Maryamossadat Aghili,Solale Tabarestani,Malek Adjouadi

doi:10.1016/j.jneumeth.2022.109582

Maryamossadat Aghili, Solale Tabarestani + Show 1 more

Open Access

https://doi.org/10.1016/j.jneumeth.2022.109582

Copy DOI

Journal: Journal of Neuroscience Methods	Publication Date: Mar 26, 2022
Citations: 10	License type: publisher-specific-oa

Affiliation: Florida International University

Abstract

BackgroundOne of the challenges facing accurate diagnosis and prognosis of Alzheimer’s disease, beyond identifying the subtle changes that define its early onset, is the scarcity of sufficient data compounded by the missing data challenge. Although there are many participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, many of the observations have a lot of missing features which often leads to the exclusion of potentially valuable data points in many ongoing experiments, especially in longitudinal studies. New methodsMotivated by the necessity of examining all participants, even those with missing tests or imaging modalities, this study draws attention to the Gradient Boosting (GB) algorithm which has an inherent capability of addressing missing values. The four groups considered include: Cognitively Normal (CN), Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI) and Alzheimer's Disease (AD). Prior to applying state of the art classifiers such as Support Vector Machine (SVM) and Random Forest (RF), the impact of imputing (i.e., replacing) data in common datasets with numerical techniques has been investigated and compared with the GB algorithm. Empirical evaluations show that the GB performance is highly resilient to missing values in comparison to SVM and RF algorithms. These latter algorithms can however be improved when coupled with more sophisticated imputation technique such as soft-impute or K-Nearest Neighbors (KNN) algorithm assuming low extent of data incompleteness. ResultsThe classification accuracy has been improved by up to 3% in the multiclass classification of all four classes of subjects when all the samples including the incomplete ones are considered during the model generation and testing phases. Comparison with existing methodsUnlike other methods, the proposed approach addresses the challenging multiclass classification of the ADNI dataset in the presence of different levels of missing data points. It also provides a comparative study on effects of existing imputation techniques on a block-wise missing data. Results of the proposed method are validated against gold standard methods used for AD classification.

Full Text