Dementia arises from various brain-affecting diseases and injuries, with Alzheimer’s disease being the most prevalent, impacting around 55 million people globally. Current clinical diagnosis often relies on biomarkers indicative of Alzheimer’s distinctive features. Electroencephalography (EEG) serves as a cost-effective, user-friendly, and safe biomarker for early Alzheimer’s detection. This study utilizes EEG signals processed with Short-Time Fourier Transform (STFT) to generate spectrograms, facilitating visualization of EEG signal properties. Leveraging the Brainlat database, we propose SpectroCVT-Net, a novel convolutional vision transformer architecture incorporating channel attention mechanisms. SpectroCVT-Net integrates convolutional and attention mechanisms to capture local and global dependencies within spectrograms. Comprising feature extraction and classification stages, the model enhances Alzheimer’s disease classification accuracy compared to transfer learning methods, achieving 92.59 ± 2.3% accuracy across Alzheimer’s, healthy controls, and behavioral variant frontotemporal dementia (bvFTD). This article introduces a new architecture and evaluates its efficacy with unconventional data for Alzheimer’s diagnosis, contributing: SpectroCVT-Net, tailored for EEG spectrogram classification without reliance on transfer learning; a convolutional vision transformer (CVT) module in the classification stage, integrating local feature extraction with attention heads for global context analysis; Grad-CAM analysis for network decision insight, identifying critical layers, frequencies, and electrodes influencing classification; and enhanced interpretability through spectrograms, illuminating key brain wave contributions to Alzheimer’s, frontotemporal dementia, and healthy control classifications, potentially aiding clinical diagnosis and management.