AbstractBackgroundIt is crucial to identify patients with Alzheimer’s disease (AD). To do so, various attempts have been made to develop AI and non‐AI assessments, among which some focus on developing acoustic‐ and linguistic‐based classifiers. They are based on the detection of impairments in their language and speech, which can manifest years before other cognitive impairments associated with AD appear. The fact is that most of the current vocal and language classifiers of AD have been trained using the Pitt corpus, which is an imbalanced class and gender dataset. These two characteristics could be enough to bias such classifiers making them untrustworthy and unsuitable for integration into AD care settings. This paper presents a novel method for collecting vocal data to reduce the impact of potential sources of bias on acoustic and linguistic classifiers for AD.MethodWe will use a data collection approach that collects voices from participants (i.e., patients with AD and healthy controls) with diversity in race, gender, educational, socioeconomic, and cultural backgrounds. Participants will be asked to perform diverse language tasks, including word association, verbal fluency, and grammaticality judgment tasks. Following such an approach can ensure we collect oral data that wouldn’t be affected by selection, recruitment, sociolinguistics, and gender biases.ResultsThe main result of this study is the creation of a benchmark vocal dataset from diverse gender and racial groups and ensuring that the data is representative of the population with AD. We expect such data to be used for evaluating and validating and help AI developers to successfully develop fair acoustic and linguistic classifiers of AD. It, in its turn, can motivate healthcare professionals to employ these systems as assistants to identify patients with AD from their voices quickly.ConclusionThe Pitt corpus is a biased data set. Thus, acoustic and linguistic classifiers trained upon it can not be considered trustworthy AI systems. The AI developers that aim to deploy vocal systems into Alzheimer’s disease care settings would need unbiased vocal data. This study proposed a method to collect such verbal data.