AbstractBackgroundWith increasingly greater prevalence of Alzheimer’s disease (AD) worldwide, cost‐ and time‐efficient methods for identifying at‐risk individuals are needed. Survey, simple neuropsychological, and demographic information are low‐cost, but their lower levels of sensitivity or specificity when considered individually are problematic. However, past work has demonstrated the usefulness of statistical machine learning tools for predicting clinical diagnosis using multimodal data. To better understand and quantify Alzheimer’s risk and progression at follow‐up using lower‐cost data, we apply statistical and machine learning tools to screen, cluster, and classify individuals from the National Alzheimer’s Coordinating Center (NACC) dataset.MethodPreviously, we developed hierarchical classification strategies to use lower‐cost data for 3‐class prediction (unimpaired, AD‐MCI, and AD‐dementia). Building upon this idea, here we leverage a Gaussian mixture model to infer subgroup identities of subjects and use subgroup identity to further aid classification. Individuals belonging to homogeneous subgroups consisting primarily of a single diagnosis are more likely to also be given that diagnosis. This approach enables confidence in screening given a threshold for identifying the primary diagnosis within each cluster. Additional classification is performed on subjects that belong to heterogeneous subgroups. The NACC database is funded by NIA/NIH Grant U24 AG072122. NACC data are contributed by the NIA‐funded ADRCs.ResultThe proposed method unmasks the clinical subgroup trends within the sample and allows for detailed longitudinal risk tracking. We identified five subgroups within the sample, representing five clinical clusters. These subgroups include predominantly unimpaired individuals, mixtures of unimpaired and MCI, MCI and AD‐dementia, and individuals predominantly with AD‐dementia. A large proportion of individuals can be confidently screened based upon subgroup identity. Using only baseline visit information, predictions of clinical diagnosis post‐clustering is highly reflective of conversion to a worse diagnosis within 18 months. Changes in the conversion rate to a more severe diagnosis are dependent on the cluster identity.ConclusionUsing cost‐ and time‐efficient data enables more effective screening and tracking of disease progression. The proposed approach of clustering, screening, and prediction yields a measure of confidence in predictions of clinical diagnosis and allows for greater understanding of clinical profiles and progression.