Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Retrospective Analysis of Clinical Performance of an Estonian Speech Recognition System for Radiology: Effects of Different Acoustic and Language Models

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.2174/2210327911666210118143758
An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition
  • Jan 1, 2022
  • International Journal of Sensors, Wireless Communications and Control
  • Ankit Kumar + 1 more

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

  • Research Article
  • Cite Count Icon 17
  • 10.1109/taslp.2014.2303295
Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
  • Mar 1, 2014
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Kartik Audhkhasi + 3 more

Diversity or complementarity of automatic speech recognition (ASR) systems is crucial for achieving a reduction in word error rate (WER) upon fusion using the ROVER algorithm. We present a theoretical proof explaining this often-observed link between ASR system diversity and ROVER performance. This is in contrast to many previous works that have only presented empirical evidence for this link or have focused on designing diverse ASR systems using intuitive algorithmic modifications. We prove that the WER of the ROVER output approximately decomposes into a difference of the average WER of the individual ASR systems and the average WER of the ASR systems with respect to the ROVER output. We refer to the latter quantity as the diversity of the ASR system ensemble because it measures the spread of the ASR hypotheses about the ROVER hypothesis. This result explains the trade-off between the WER of the individual systems and the diversity of the ensemble. We support this result through ROVER experiments using multiple ASR systems trained on standard data sets with the Kaldi toolkit. We use the proposed theorem to explain the lower WERs obtained by ASR confidence-weighted ROVER as compared to word frequency-based ROVER. We also quantify the reduction in ROVER WER with increasing diversity of the N-best list. We finally present a simple discriminative framework for jointly training multiple diverse acoustic models (AMs) based on the proposed theorem. Our framework generalizes and provides a theoretical basis for some recent intuitive modifications to well-known discriminative training criterion for training diverse AMs.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.apacoust.2020.107386
Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit
  • May 5, 2020
  • Applied Acoustics
  • Jyoti Guglani + 1 more

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/asru.2007.4430114
Non-native pronunciation variation modeling using an indirect data driven method
  • Jan 1, 2007
  • Mina Kim + 2 more

In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icassp.2008.4518601
Acoustic and pronunciation model adaptation for context-independent and context-dependent pronunciation variability of non-native speech
  • Mar 1, 2008
  • Yoo Rhee Oh + 2 more

In this paper, we propose an acoustic and pronunciation model adaptation method for context-independent (CI) and context-dependent (CD) pronunciation variability to improve the performance of a non-native automatic speech recognition (ASR) system. The proposed adaptation method is performed in three steps. First, we perform phone recognition to obtain an n-best list of phoneme sequences and derive pronunciation variant rules by using a decision tree. Second, the pronunciation variant rules are decomposed into CI and CD pronunciation variation on the basis of context dependency. That is, some pronunciation variant rules that are dedicated to the specific phoneme sequences is classified into CI pronunciation variation, but others are classified into CD one. It is assumed here that CI and CD pronunciation variabilities are invoked by a different pronunciation space from the mother tongue of a non-native speaker and the coarticulation effects in a context, respectively. Third, the acoustic model adaptation is performed in a state-tying step for the CI pronunciation variability from an indirect data-driven method. In addition, the pronunciation model adaptation is completed by constructing a multiple pronunciation dictionary using the CD pronunciation variability. It is shown from the continuous Korean-English ASR experiments that the proposed method can reduce the average word error rate (WER) by 16.02% when compared with the baseline ASR system that is trained by native speech. Moreover, an ASR system using the proposed method provides average WER reductions of 8.95% and 3.67% when compared to the only acoustic model adaptation and the only pronunciation model adaptation, respectively.

  • Conference Article
  • Cite Count Icon 27
  • 10.1109/icassp.2019.8683086
Analyzing Uncertainties in Speech Recognition Using Dropout
  • May 1, 2019
  • Apoorv Vyas + 3 more

The performance of Automatic Speech Recognition (ASR) systems is often measured using Word Error Rates (WER) which requires time-consuming and expensive manually transcribed data. In this paper, we use state-of-the-art ASR systems based on Deep Neural Networks (DNN) and propose a novel framework which uses "Dropout" at the test time to model uncertainty in prediction hypotheses. We systematically exploit this uncertainty to estimate WER without the need for explicit transcriptions. In addition, we show that the predictive uncertainty can also be used to accurately localize the errors made by the ASR system. We study the performance of our approach on Switchboard database where it predicts WER accurately within a range of 2.6% and 5.0% for HMM-DNN and Connectionist Temporal Classification (CTC) ASR systems, respectively.

  • Conference Article
  • Cite Count Icon 12
  • 10.23919/apsipa.2018.8659622
Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems
  • Nov 1, 2018
  • Tomohiro Tanaka + 3 more

In this paper, we propose to leverage end-to-end automatic speech recognition (ASR) systems for assisting deep neural network-hidden Markov model (DNN-HMM) hybrid ASR systems. The DNN-HMM hybrid ASR system, which is composed of an acoustic model, a language model and a pronunciation model, is known to be the most practical architecture in ASR field. On the other hand, much attention has been paied in recent studies to the end-to-end ASR systems that are fully composed of neural networks. It is known that they can yield comparative performance without introducing heuristic operations. However, one problem is that the end-to-end ASR systems sometimes suffer from redundant generation and ommission of important words in text generation phases. This is because these systems cannot explicitly consider the connection between the input speech and the output text. Therefore, our idea is to regard the end-to-end ASR systems as neural speech-to-text language models (NS2TLMs) and to use them for rescoring hypotheses generated in the DNN-HMM hybrid ASR systems. This enables us to leverage the end-to-end ASR systems while avoiding the generation issues because the DNN-HMM hybrid ASR systems can generate speech-aligned hypotheses. It is expected that the NS2TLMs improve the DNN-HMM hybrid ASR systems because the end-to-end ASR systems correctly handle short-duration utterances. In our experiments, we use state-of-the-art DNN-HMM hybrid ASR systems with convolutional and long short-term memory recurrent neural network acoustic models and end-to-end ASR systems based on attetional encoder-decoder. We demonstrate that our proposed method can yield a better ASR performance than both the DNN-HMM hybrid ASR system and the end-to-end ASR system.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/ncc.2019.8732237
Data-pooling and multi-task learning for enhanced performance of speech recognition systems in multiple low resourced languages
  • Feb 1, 2019
  • A Madhavaraj + 1 more

We present two approaches to improve the performance of automatic speech recognition (ASR) systems for Gujarati, Tamil and Telugu. In the first approach using data-pooling with phone mapping (DP-PM), a deep neural network (DNN) is trained to predict the senones for the target language; then we use the feature vectors and their alignments from other source languages to map the phones from the source to the target language. The lexicons of the source languages are then modified using this phone mapping and an ASR system for the target language is trained using both the target and the modified source data. This DP-PM approach gives relative improvements in word error rates (WER) of 5.1% for Gujarati, 3.1% for Tamil and 3.4% for Telugu, over the corresponding baseline figures. In the second approach using multi-task DNN (MT-DNN) modeling, we use feature vectors from all the languages and train a DNN with three output layers, each predicting the senones of one of the languages. Objective functions of the output layers are modified such that during training, only those DNN layers responsible for predicting the senones of a language are updated, if the feature vector belongs to that language. This MT-DNN approach achieves relative improvements in WER of 5.7%, 3.3% and 5.2% for Gujarati, Tamil and Telugu, respectively.

  • Research Article
  • Cite Count Icon 13
  • 10.1002/lary.31713
Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech.
  • Aug 19, 2024
  • The Laryngoscope
  • Robin Zhao + 3 more

To evaluate the performance of commercial automatic speech recognition (ASR) systems on d/Deaf and hard-of-hearing (d/Dhh) speech. A corpus containing 850 audio files of d/Dhh and normal hearing (NH) speech from the University of Memphis Speech Perception Assessment Laboratory was tested on four speech-to-text application program interfaces (APIs): Amazon Web Services, Microsoft Azure, Google Chirp, and OpenAI Whisper. We quantified the Word Error Rate (WER) of API transcriptions for 24 d/Dhh and nine NH participants and performed subgroup analysis by speech intelligibility classification (SIC), hearing loss (HL) onset, and primary communication mode. Mean WER averaged across APIs was 10 times higher for the d/Dhh group (52.6%) than the NH group (5.0%). APIs performed significantly worse for "low" and "medium" SIC (85.9% and 46.6% WER, respectively) as compared to "high" SIC group (9.5% WER, comparable to NH group). APIs performed significantly worse for speakers with prelingual HL relative to postlingual HL (80.5% and 37.1% WER, respectively). APIs performed significantly worse for speakers primarily communicating with sign language (70.2% WER) relative to speakers with both oral and sign language communication (51.5%) or oral communication only (19.7%). Commercial ASR systems underperform for d/Dhh individuals, especially those with "low" and "medium" SIC, prelingual onset of HL, and sign language as primary communication mode. This contrasts with Big Tech companies' promises of accessibility, indicating the need for ASR systems ethically trained on heterogeneous d/Dhh speech data. 3 Laryngoscope, 135:191-197, 2025.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.specom.2008.05.004
Combined speech enhancement and auditory modelling for robust distributed speech recognition
  • May 20, 2008
  • Speech Communication
  • Ronan Flynn + 1 more

Combined speech enhancement and auditory modelling for robust distributed speech recognition

  • Conference Article
  • Cite Count Icon 129
  • 10.1109/icassp.1998.675366
Incorporating information from syllable-length time scales into automatic speech recognition
  • Jan 1, 1998
  • Su-Lin Wu + 3 more

Including information distributed over intervals of syllabic duration (100-250 ms) may greatly improve the performance of automatic speech recognition (ASR) systems. ASR systems primarily use representations and recognition units covering phonetic durations (40-100 ms). Humans certainly use information at phonetic time scales, but results from psychoacoustics and psycholinguistics highlight the crucial role of the syllable, and syllable-length intervals, in speech perception. We compare the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that uses a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone-scale and syllable-scale recognizers by merging and rescoring N-best lists. Using the combined recognition system, we observed an improvement in word error rate for telephone-bandwidth, continuous numbers from 6.8% to 5.5% on a clean test set, and from 27.8% to 19.6% on a reverberant test set, over the baseline phone-based system.

  • Book Chapter
  • Cite Count Icon 2
  • 10.5772/4753
Autocorrelation-based Methods for Noise-Robust Speech Recognition
  • Jun 1, 2007
  • Gholamreza Farahani + 2 more

In this chapter, the importance of autocorrelation domain in robust feature extraction for speech recognition was discussed. To prove the effectiveness of this domain, some recently proposed methods for robust feature extraction against additive noise were discussed. These methods resulted in cepstral feature sets derived from the autocorrelation spectral domain. The DAS algorithm used the differentiated filtered autocorrelation spectrum of the noisy signal to extract cepstral parameters. We noted that similar to RAS and DPS, DAS can better

  • Research Article
  • 10.1080/03772063.2015.1119660
Dynamic Pronunciation Modelling for Unsupervised Learning of ASR Systems
  • Apr 26, 2016
  • IETE Journal of Research
  • Akella Amarendra Babu + 2 more

ABSTRACTThere is a large gap between the capabilities of the human beings and the automatic speech recognition (ASR) systems in recognizing pronunciation variations. ASR systems learn from labelled speech corpus, whereas the humans use “Everyday Speech” for adapting pronunciation variability. Labelling huge speech corpus in real time is impracticable, expensive, and time-consuming. In this paper, we present an algorithm using unsupervised learning techniques for adapting the easily available “Everyday Speech”. The algorithm is implemented using Java. The data sets are extracted from CMUDICT pronunciation directory, TIMIT database, and “The Hindu” daily newspaper. The results have shown a significant improvement in word error rate (WER) measurements over the existing ASR system. The addition of dynamic pronunciation model enables the ASR system to learn from the unlabelled “Everyday Speech” and makes it inexpensive and fast.

  • Research Article
  • Cite Count Icon 44
  • 10.1007/s10772-018-9497-6
Continuous Punjabi speech recognition model based on Kaldi ASR toolkit
  • Feb 16, 2018
  • International Journal of Speech Technology
  • Jyoti Guglani + 1 more

In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted from Punjabi continuous speech samples. The performance of automatic speech recognition (ASR) system for both monophone and triphone model i.e., tri1, tri2 and tri3 model using N-gram language model is reported. The performance of ASR system were computed in terms of word error rate (WER). A significant reduction in WER was observed using the tri phone model over mono phone model ASR .Also the performance of ASR using tri3 model is improved over tri2 model and the performance of tri2 model is improved over tri1 model ASR. Further, it was found that MFCC feature provides higher speech recognition accuracy than PLP features for continuous Punjabi speech.

  • Research Article
  • 10.1121/1.409548
Speech production models in automatic speech recognition—Forming a lasting marriage between speech science and speech technology
  • May 1, 1994
  • The Journal of the Acoustical Society of America
  • R. C. Rose + 3 more

At present, the performance of automatic speech recognition (ASR) systems is still limited by variabilities within and between speakers, by acoustic differences between training and application environments, and by the sensitivity of ASR systems against changing communication channels. This talk considers the conjecture that the use of speech-production models in ASR systems can contribute to making ASR systems more robust with respect to these sources of variability. Although it is well known that production-oriented representations of speech may be used to exploit the continuity of articulatory movements, several obstacles stand in the way of incorporating speech production models in recognizers. These include the difficult problem of acoustic-to-articulatory mapping, the hugh complexity of searching an articulatory space, and the lack of sensitive diagnostic performance metrics for evaluating strengths and weaknesses of a particular production model. Several research laboratories are actively involved in efforts towards incorporating articulatory models in various forms in working ASR systems. In addition to summarizing this work, mechanisms will be suggested for stimulating closer interaction between researchers in production, perception, and processing.

Save Icon
Up Arrow
Open/Close
Setting-up Chat
Loading Interface