Machine learning algorithms are a promising automated candidate that can help mitigate the growing need for dementia experts. Despite the substantial development in MRI-based machine learning analyses, case misclassification is a universal finding, yet the reasons behind misclassification are poorly understood. We implemented a multi-class classification approach that uses relevance vector machine and logistic classification to classify research participants based on their whole-brain T1-weighted MRI scans. A total of 468 participants from seven diagnostic classes were included: 144 healthy controls, 84 Alzheimer's disease, 108 behavioral variant frontotemporal dementia (bvFTD), 30 semantic variant primary progressive aphasia (svPPA), 30 non-fluent variant primary progressive aphasia (nfvPPA), 30 corticobasal syndrome (CBS), and 42 progressive supranuclear palsy syndrome (PSPS). We compared the algorithm's diagnostic accuracy against the clinical, pathological, genetic, and quantitative imaging data. The exact neurodegenerative syndrome was predicted in 71% of the cases, the neurodegenerative disease spectrum was predicted in 80% of the cases, and the algorithm distinguished controls from any dementia in 85% of the cases. The algorithm showed high performance in diagnosing healthy controls, moderate performance in diagnosing AD, bvFTD, and svPPA, and low performance in diagnosing CBS, nfvPPA, and PSPS. Based on the quantitative imaging data, most of the misclassified neurodegenerative cases had minimal atrophy and brain volumes comparable to healthy controls. In AD, early-onset AD cases with minimal brain atrophy represented most of the misclassified cases. In bvFTD, FTD genetic mutation carriers (predominantly C9orf72 repeat expansion), FTD phenocopy, patients meeting only possible bvFTD criteria represented most misclassified cases. Case misclassification in machine learning studies in neurodegenerative diseases results from neurodegenerative disease heterogeneity and the limitations of structural MRI's ability to capture the whole gamut of biological changes. Larger and more inclusive datasets that are representative of population biologic heterogeneity are needed to train better machine learning techniques, and a margin of error is expected and should be acceptable, like the uncertainty of a clinical diagnosis by a dementia expert.