Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Yue Guo,Serguei Pakhomov,Carol Roan,Trevor Cohen,Changye Li

doi:10.3389/fcomp.2021.642517

Abstract

Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS ‘controls’ (participants without indications of abnormal cognitive status), and these data in conjunction with inferred ‘cases’ (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.

Highlights

Alzheimer’s Dementia (AD) is a debilitating condition with few symptomatic treatments and no known cure
On a random 80/20 train/test split of the DementiaBank data, the authors report a considerable advantage in performance with the addition of the Wisconsin Longitudinal Study (WLS) controls in particular, with improvements of over 10% in macro-averaged F-measure across a range of machine learning methods trained on a set of 567 manually engineered features, with oversampling offering an advantage over training without balancing the set in some but not all methods
We evaluated the utility of the incorporation of additional “Cookie Theft” transcripts drawn from the Wisconsin Longitudinal Study as a means to improve the performance of a Bidirectional Encoder Representations from Transformers (BERT)-based classifier on the Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSS) challenge diagnosis task

Summary

Introduction

Alzheimer’s Dementia (AD) is a debilitating condition with few symptomatic treatments and no known cure. Earlier diagnosis of AD has the potential to ease the burden of disease on patients and caregivers by reducing family conflict and providing more time for financial and care planning (Boise et al, 1999; Bond et al, 2005; Stokes et al, 2015) Delayed diagnosis of this condition contributes substantively to the cost of care of this disease on account of a high utilization of emergency rather than routine care, amongst other factors—it is estimated that early and accurate diagnosis can help save an estimated $7.9 trillion in medical and care costs (Association, 2018). On a random 80/20 train/test split of the DementiaBank data, the authors report a considerable advantage in performance with the addition of the WLS controls in particular, with improvements of over 10% (absolute) in macro-averaged F-measure across a range of machine learning methods trained on a set of 567 manually engineered features, with oversampling offering an advantage over training without balancing the set in some but not all methods

Methods

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in computer science	Publication Date: Apr 16, 2021
Citations: 15	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in computer science

Lead the way for us

Similar Papers

Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer’s Dementia Recognition from Spontaneous Speech
Morteza Rohanian ... Julian Hough
-
Morteza Rohanian, et. al.Morteza Rohanian ... Julian Hough
25 Oct 2020
25 Oct 2020

KNNOR: An oversampling technique for imbalanced datasets
Ashhadul Islam ... Halima Bensmail
Applied Soft Computing Journal | VOL. 115
Ashhadul Islam, et. al.Ashhadul Islam ... Halima Bensmail
10 Dec 2021
Applied Soft Computing Journal | VOL. 115

Evaluating external generalizability of machine learning models for recycled aggregate concrete property prediction
Shreyas Pandurang Jadhav ... Nikhil Bugalia
Journal of cleaner production | VOL. 469
Shreyas Pandurang Jadhav, et. al.Shreyas Pandurang Jadhav ... Nikhil Bugalia
01 Jul 2024
Journal of cleaner production | VOL. 469

An Assessment of Paralinguistic Acoustic Features for Detection of Alzheimer's Dementia in Spontaneous Speech
Fasih Haider ... Saturnino Luz
IEEE journal of selected topics in signal processing | VOL. 14
Fasih Haider, et. al.Fasih Haider ... Saturnino Luz
25 Nov 2019
IEEE journal of selected topics in signal processing | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in computer science