Multilingual text normalization for computer‐based detection of Alzheimer’s disease

Frederic Abiven,Sylvie Ratté

doi:10.1002/alz.037526

Abstract

AbstractBackgroundUnderstanding how Alzheimer’s Disease affects linguistic functions could improve early detection of the disease, since many studies have demonstrated that language alterations can appear at an early stage. In order to measure those functions, we process and analyze transcripts from multiple corpora based on the Cookie‐Theft picture description task. Text normalization plays a big role when it comes to creating a solid dataset from a heterogeneous corpus of patients’ interviews. Current tools for this type of task are limited and tend to be language and context dependant.MethodThis paper presents a simple and efficient method to process textual data in a pipeline architecture, sequencing sub‐problems of cleaning and normalization tasks. Since some of them are language and context dependant, they were made easily configurable, increasing scalability when dealing with new corpora. Then, multiple measures are extracted while cleaning transcripts, as it also contains valuable information, like the number of repetitions or incomplete words removed.ResultResults show that we are able to improve performance on this time consuming task when working with a multilingual dataset compared to previous studies. In fact, we were able to normalize a French and English Cookie‐Theft corpus easily. Given the great diversity of languages and related structures, this method therefore has certain limitations. Moreover, our findings have shown great potential in cleaning and normalizing measures extracted for detecting Alzheimer’s disease. For instance, the number of retracings removed from transcripts revealed a significant correlation (> 0.5) with the severity of cognitive impairment.ConclusionThus, this paper contributes to Alzheimer’s disease literature by presenting an efficient tool which allows to speed up the cleaning and normalization process of transcripts. Furthermore, extracted measures from this task could improve results when training for a predictive model in AD detection, since it captures some metrics highly correlated with the patient’s mental health status. Finally, this tool could eventually be used for different types of description tasks since it is not dependant of the context.

Full Text