Evaluating Fluency in Aphasia: Fluency Scales, Trichotomous Judgements, or Machine Learning

Jeet Metu,Vishal Kotha,Argye E Hillis

doi:10.1080/02687038.2023.2171261

Jeet Metu, Vishal Kotha + Show 1 more

https://doi.org/10.1080/02687038.2023.2171261

Copy DOI

Abstract

ABSTRACT Background Speech-language pathologists (SLPs) and other clinicians often use aphasia batteries, such as the Western Aphasia Battery-Revised (WAB-R), to evaluate both severity and classification of aphasia. However, the fluency scale on the WAB-R is not entirely objective and has been found to have less than ideal inter-rater reliability, due to variability in weighing the importance of one dimension (e.g. articulatory effort or grammaticality) over another. This limitation has implications for aphasia classification. The subjectivity might be mitigated through the implementation of machine learning to identify fluent and non-fluent speech. Aims We hypothesized that two models consisting of convolutional and recurrent neural networks can be used to identify fluent and non-fluent aphasia as judged by SLPs, with greater reliability than use of the WAB-R fluency scale. Methods & Procedures The training and testing dataset for the networks was collected from the public domain, and the validation dataset was collected from participants in post-stroke aphasia studies. We used Kappa scores to evaluate inter-rater reliability among SLPs, and between the networks and SLPs. Outcome and Results Using public domain samples, the model for detecting non-fluent aphasia achieved high accuracy on the training dataset after 10 epochs (i.e., when algorithm scans the entire dataset) and 81% testing accuracy using public domain samples. The model for detecting fluent speech had high training accuracy and 83% testing. Across samples, using the WAB-R fluency scale, there was poor to perfect agreement among SLPs on the precise WAB-R fluency score, but substantial agreement on non-fluent (score 0-4) versus fluent (score of 5-9). The agreement between the model and the SLPs was moderate for identifying non-fluent speech and substantial fpr identifying fluent speech. When SLPs were asked to identify each sample as fluent, non-fluent, or mixed (without using the fluency scale), the agreement between SLPs was almost perfect (Kappa 0.94). The agreement between the SLPs’ trichotomous judgement and the models was fair for detecting non-fluent speech and substantial for detecting fluent speech. Conclusions Results indicate that neither the WAB-R fluency scale nor the machine learning algorithms were as useful (reliable and valid) as a simple trichotomous judgement of fluent, non-fluent, or mixed by SLPs. These results, together with data from the literature, indicate that it is time to re-consider use of the WAB-R fluency scale for classification of aphasia. It is also premature, at present, to rely on machine learning to rate spoken language fluency.

Full Text