AbstractBackgroundLate‐onset Alzheimer disease (AD) is a heterogeneous disease, as demonstrated by its wide range of genetic and environmental risk factors, as well as diversity of clinical manifestations. Cognizant of the futility of a one‐size‐fits‐all approach for heterogeneous diseases, precision medicine aims to treat each subtype appropriately, but methods for successful diagnosis of AD subtypes has been limited. Real‐valued biomarkers, such as omics measurements in plasma or CSF, frequently are evaluated based on fold change (FC) and/or the area under the receiver operating characteristic curve (AUC) metric. We previously demonstrated the inability for FC and AUC to capture signals for subtypes that comprise less than 50% of the diseased individuals due to inherently low true positive rates and proposed an alternative approach based on the bimodality of the data [Smith and Climer, https://doi.org/10.1101/2022.02.14.22270972]. This approach is based on statistical characteristics of the data, including skewness, cardinality, and kurtosis. k‐medians clustering, a machine learning approach to identify natural clusters in data, removes these statistical assumptions, but optimally solving k‐medians is generally NP‐hard, yielding computational intractability for datasets of interest. Lloyd’s approximation algorithm is commonly used in practice, however, its accuracy is highly sensitive to the random start state, thus is not suitable when accuracy is paramount.MethodOwing to the nature of the problem, we present an evaluation metric based on the optimal k‐medians objective with k = 2. Additionally, due to the single dimensionality of the data, we introduce a dynamic programming algorithm to improve computational performance.ResultAlthough optimally solving k‐medians is NP‐hard, our 2‐medians objective over 1‐dimensional data using dynamic programming has an average time complexity of O(n log n), and worst‐case complexity of O(n 2) where n is the number of data values. This objective has no reliance upon false positive rate, skewness, or kurtosis, making it robust for real‐valued data drawn from heterogeneous AD subtypes.ConclusionBiomarker data drawn from heterogeneous AD subtypes will have inherently low true positive rates, resulting with low FC and AUC values. Our machine‐learning based approach for evaluating potential biomarkers efficiently produces optimal clustering of the data and scales well to large datasets.
Read full abstract