Evaluation of German Automatic Speech Recognition solutions in the context of speech and language therapy support of people with aphasia

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Those who suffer from aphasia benefit from digital speech and language therapy solutions, and automatic speech recognition (ASR) has been already used for giving feedback on the correctness of the answers in naming exercises. AphaDIGITAL application is to provide German-speaking users with detailed feedback on phonemic/phonetic and semantic errors, based on automatic speech and language processing. For this purpose, open-source ASR solutions for German were evaluated on different corpora of atypical speech, including two small datasets with aphasic speech samples. Character error rate, the number of precisely recognized items and empty outputs served as evaluation metrics. The four selected models are generally robust to the deteriorated condition of speech and audio quality and consistently outperform commercial models in atypical speech recognition. Applying error acceptance threshold, additional use of phonemic error rate, and other valuable insights for ASR implementation in aphaDIGITAL are discussed.

Similar Papers
  • Research Article
  • 10.1080/02687038.2026.2621235
Evaluating ASR for aphasia: a framework for clinically relevant transcription performance
  • Feb 16, 2026
  • Aphasiology
  • Julien Dupuis Desroches + 2 more

Background The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 24
  • 10.18653/v1/2021.eacl-main.58
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
  • Jan 1, 2021
  • Yerbolat Khassanov + 5 more

We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.

  • Research Article
  • Cite Count Icon 1
  • 10.33411/ijist/202461115131
Towards End-to-End Speech Recognition System for Pashto Language Using Transformer Model
  • Feb 25, 2024
  • International Journal of Innovations in Science and Technology
  • Munazza Sher + 2 more

The conventional use of Hidden Markov Models (HMMs), and Gaussian Mixture Models (GMMs)for speech recognition posed setup challenges and inefficiency. This paper adopts the Transformer model for Pashto continuous speech recognition, offering an End-to-End (E2E) system that directly represents acoustic signals in the label sequence, simplifying implementation. This study introduces a Transformer model leveraging its state-of-the-art capabilities, including parallelization and self-attention mechanisms. With limited data for Pashto, the Transformer is chosen for its proficiency in handling constraints. The objective is to develop an accurate Pashto speech recognition system. Through 200 hours of conversational data, the study achieves a Word Error Rate (WER) of up to 51% and a Character Error Rate (CER) of up to 29%. The model's parameters are fine-tuned, and the dataset size increased, leading to significant improvements. Results demonstrate the Transformer's effectiveness, showcasing its prowess in limited data scenarios. The study attains notable WER and CER metrics, affirming the model's ability to recognize Pashto speech accurately. In conclusion, the study establishes the Transformer as a robust choice for Pashto speech recognition, emphasizing its adaptability to limited data conditions. It fills a gap in ASR research for the Pashto language, contributing to the advancement of speech recognition technology in under-resourced languages. The study highlights the potential for further improvement with increased training data. The findings underscore the importance of fine-tuning and dataset augmentation in enhancing model performance and reducing error rates.

  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13636-025-00395-5
Comparative performance analysis of end-to-end ASR models on Indo-Aryan and Dravidian languages within India’s linguistic landscape
  • Feb 24, 2025
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Palash Jain + 1 more

India’s linguistic diversity encompasses multiple language families, including the Indo-Aryan and Dravidian, which represent distinct phonological and morphological characteristics. This study aims to evaluate and compare the performance of end-to-end automatic speech recognition (ASR) systems for three Indo-Aryan languages—Marathi, Odia, and Gujarati—and three Dravidian languages—Tamil, Telugu, and Malayalam. Using four transformer-based pre-trained models—Wav2Vec2.0-base, XLSR-53, W2V2-BERT, and Whisper small—the analysis explores their adaptability to these languages’ linguistic features, with word error rate (WER) and character error rate (CER) serving as evaluation metrics. Results indicate that W2V2-BERT and XLSR-53 outperform other models, achieving lower WER and CER, especially for Indo-Aryan languages. However, higher error rates for Dravidian languages highlight challenges such as complex phonology and agglutinative morphology. This work provides a comparative insight into the strengths and limitations of pre-trained ASR models across India’s diverse linguistic landscape and underscores the need for language-specific adaptations to improve ASR accuracy for underrepresented languages.

  • Book Chapter
  • Cite Count Icon 12
  • 10.1007/978-3-642-41491-6_29
A New Word Language Model Evaluation Metric for Character Based Languages
  • Jan 1, 2013
  • Peilu Wang + 3 more

Perplexity is a widely used measure to evaluate word prediction power of a word-based language model. It can be computed independently and has shown good correlation with word error rate (WER) in speech recognition. However, for character based languages, character error rate (CER) is commonly used instead of WER as the measure for speech recognition, although language model is still word based. Due to the fact that different word segmentation strategies may result in different word vocabulary for the same text corpus, in many cases, word-based perplexity is incompetent to evaluate the combined effect of word segmentation and language model training to predict final CER. In this paper, a new word-based language model evaluation measure is proposed to account for the effect of word segmentation and the goal of predicting CER. Experiments were conducted on Chinese speech recognition. Compared to the traditional word-based perplexity, the new measure is more robust to word segmentation and shows much more consistent correlation with CER in a large vocabulary continuous Chinese speech recognition task.

  • Preprint Article
  • 10.1101/2025.03.26.25324592
Transcribing multilingual radiologist-patient dialogue into mammography reports using AI: A step towards patient-centric radiology
  • Mar 26, 2025
  • Amit Gupta + 4 more

BackgroundRadiology reports are primarily designed for healthcare professionals, often containing complex medical terminology hindering patients from understanding their diagnostic results. This communication gap is especially pronounced in non-English-speaking regions. AI-driven transcription and report generation, leveraging automated speech recognition (ASR) and large language models (LLMs), could enable patient-centered, accessible reporting from radiologist-patient conversations in vernacular language.PurposeTo evaluate the feasibility of AI-driven transcription and automated mammography report generation from simulated radiologist-patient conversations in vernacular language, assessing transcription accuracy, report concordance, error patterns, and time efficiency.Materials and MethodsA curated dataset of 50 mammograms was retrospectively selected from the Picture Archiving and Communication System (PACS) of our department. Simulated radiologist-patient conversations, conducted in vernacular Hindi, were recorded and transcribed using the OpenAI Whisper large-v2 ASR model. Four transcriptions per conversation were generated at different temperatures (0, 0.3, 0.5, 0.7) to maximize information capture. Structured mammography reports were generated from the transcriptions using GPT-4o, guided by detailed prompt instructions. Reports were reviewed and corrected by a radiologist, and AI performance was assessed through word error rate (WER), character error rate (CER), report concordance rates, error analysis, and time efficiency metrics.ResultsThe lowest WER (0.577) and CER (0.379) were observed at temperature 0. The overall mean concordance rate between AI-generated and radiologist-edited reports was 0.94, with structured fields achieving higher concordance than descriptive fields. Errors were present in 50% of AI-generated reports, predominantly missed and incorrect information, with a higher error rate in malignant cases. The mean time for AI-driven report generation was 207.4 seconds, with radiologist editing contributing 43.1 seconds on average.ConclusionAI-driven workflow integrating ASR and LLMs to generate structured mammography reports from radiologist-patient conversations in vernacular language, is feasible. While challenges such as privacy, validation, and scalability remain, this approach represents a significant step toward patient-centric and AI-integrated radiology practice.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-030-36204-1_15
Improved CTC-Attention Based End-to-End Speech Recognition on Air Traffic Control
  • Jan 1, 2019
  • Kai Zhou + 4 more

Recently, many end-to-end speech recognition systems have been proposed aim to directly transcribes speech to text without any predefined alignments. In this paper, we improved the architecture of joint CTC-attention based encoder-decoder model for Mandarin speech recognition on Air Traffic Control speech recognition task. Our improved system include a Vggblstm based encoder, an attention LSTM based decoder decoded with CTC mechanism and a LSTM based ATC language model. In addition, several tricks are used for effective model training, including L2 regularization, attention smoothing and frame skipping. In this paper, we compare our improved model with other three popular end-to-end systems on ATC corpus. Result shows that our improved CTC-attention model outperforms CTC, attention and original CTC-attention model without any tricks and language model. Taken these tricks together we finally achieve a character error rate (CER) of 13.15% and a sentence error rate (SER) of 33.43% on the ATC dataset. While together with a LSTM language model, CER and SER reach 11.01% and 22.75%, respectively.

  • Conference Article
  • Cite Count Icon 44
  • 10.1109/icassp.2019.8682890
Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders
  • May 1, 2019
  • Shigeki Karita + 5 more

We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

  • Research Article
  • Cite Count Icon 3
  • 10.56345/ijrdv9n301
A Model for Albanian Speech Recognition Using End-to-End Deep Learning Techniques
  • Jul 1, 2022
  • Interdisciplinary Journal of Research and Development
  • Amarildo Rista + 1 more

End-to-end Automatic Speech Recognition (ASR) system folds the acoustic model (AM), language model (LM), and pronunciation model (PM) into a single neural network. The joint optimization of all these components optimizes performance of the model. In this paper, we introduce a model for Albanian speech recognition (SR) using end-to-end deep learning techniques. The two main modules that build this model are: Residual Convolutional Neural Networks (ResCNN), which aims to learn the relevant features and Bidirectional Recurrent Neural Networks (BiRNN) aiming to leverage the learned ResCNN audio features. To train and evaluate the model, we have built a corpus for Albanian Speech Recognition (CASR), which contains 100 hours of audio data along with their transcripts. During the design of the corpus we took into account the attributes of the speaker such as: age, gender, and accent, speed of utterance and dialect, so that it is as heterogeneous as possible. The evaluation of the model is done through word error rate (WER) and character error rate (CER) metrics. It achieves 5% WER and 1% CER.

  • Conference Article
  • Cite Count Icon 1
  • 10.1117/12.2594212
Expanding smart assistant accessibility through dysarthria speech-trained transformer networks
  • Aug 1, 2021
  • Daniel Adams + 1 more

Smart assistant usage has increased significantly with the AI boom and growth of IoT. Speech as an input modality brings a level of personalization to the various smart voice assistant products and applications; however, many smart assistants underperform when tasked with interpreting atypical speech input. Dysarthria, heavy accents, and deaf and hard-of-hearing speech characteristics prove difficult for smart assistants to interpret despite the large amounts of diverse data used to train automatic speech recognition models. In this study, we explore the Transformer architecture for use as an automatic speech recognition model for speech with medium to low intelligibility scores. We utilize the Transformer model pre-trained on the Librispeech dataset and fine-tuned on the Torgo dataset of atypical speech, as well as a subset of the University of Memphis Speech Perception Assessment Laboratory’s (UMemphis SPAL) Deaf speech dataset. We also develop a methodology for performing automatic speech recognition using a Node.JS application running on a Raspberry Pi 4 to function as a pipeline between the user and a Google Home smart assistant device. The highest performing Transformer model shows a 20.2% character error rate with a corresponding 29.0% word error rate on a subset of medium intelligibility audio samples from the UMemphis SPAL dataset. This study highlights the importance for a large, transcribed dataset, fueling a large atypical-speech data gathering effort through a newly developed web application, My-Voice.

  • Research Article
  • Cite Count Icon 4
  • 10.7717/peerj-cs.1981
Customized deep learning based Turkish automatic speech recognition system supported by language model.
  • Apr 3, 2024
  • PeerJ Computer Science
  • Yasin Görmez

In today's world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people's daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model's performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.ijmedinf.2025.106029
Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs.
  • Nov 1, 2025
  • International journal of medical informatics
  • Chen Chen + 8 more

Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/indicon56171.2022.10039926
Automatic Correction of Speech Recognized Mathematical Equations using Encoder-Decoder Attention Model
  • Nov 24, 2022
  • Y Mounika + 4 more

Automated correction of common and systematic errors caused by the ASR system is part of the post-processing of automatic speech recognition (ASR). The ASR system’s output is prone to grammatical, spelling and phonetic problems. The sentences which are the outputs from Automated Speech Recognition (ASR) models have been refined using error correction approaches to obtain a reduced word error rate (WER) and Character Error Rate (CER) than the initial ASR outputs. In this paper we propose a model that reduces the word error rate and character error rate of the speech recognized mathematical equations. The proposed model is the encoder-decoder with attention model.

  • Conference Article
  • Cite Count Icon 66
  • 10.1109/icassp.2018.8462492
Attention-Based End-to-End Speech Recognition on Voice Search
  • Apr 1, 2018
  • Changhao Shan + 3 more

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.apacoust.2024.110408
Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms
  • Nov 26, 2024
  • Applied Acoustics
  • P Hemant + 1 more

Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.