Development of a Deep Learning-Based Text-To-Speech System for the Malang Walikan Language Using the Pre-Trained SpeechT5 and Hifi-GAN Models
The Walikan language of Malang is a form of local cultural heritage that needs to be preserved in the digital era. This study aims to develop and evaluate a deep learning-based Text-to-Speech (TTS) system capable of generating speech in the Walikan language of Malang using pre-trained SpeechT5 and HiFi-GAN models without fine-tuning. In this system, SpeechT5 is used to convert text into mel-spectrograms, while HiFi-GAN acts as a vocoder to generate audio signals from the mel-spectrograms. The dataset used consists of 1,000 sentences in the Walikan language of Malang. The system evaluation was carried out using objective metrics of Word Error Rate (WER) and Character Error Rate (CER), by comparing the results of synthetic audio transcriptions against two types of reference audio, namely the original voices of female speakers and male speakers, using the Automatic Speech Recognition (ASR) system. The female voice was recorded with controlled articulation, while the male voice used natural intonation in everyday conversation. The results show that synthetic audio has the highest error rate with a WER of 0.9786 and a CER of 0.9024. Meanwhile, female audio has a WER of 0.5471 and a CER of 0.1822, while male audio shows a WER of 0.6311 and a CER of 0.2541. These findings indicate that the TTS model without fine-tuning is not yet capable of producing synthetic voices that can be recognized accurately by the ASR system, especially for regional languages that are not included in the initial training data. Therefore, the fine-tuning process and the preparation of a more representative dataset are important so that the TTS system can support the preservation of the Walikan Malang language more effectively in the digital era.
- Research Article
14
- 10.1109/taslp.2014.2303295
- Mar 1, 2014
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Diversity or complementarity of automatic speech recognition (ASR) systems is crucial for achieving a reduction in word error rate (WER) upon fusion using the ROVER algorithm. We present a theoretical proof explaining this often-observed link between ASR system diversity and ROVER performance. This is in contrast to many previous works that have only presented empirical evidence for this link or have focused on designing diverse ASR systems using intuitive algorithmic modifications. We prove that the WER of the ROVER output approximately decomposes into a difference of the average WER of the individual ASR systems and the average WER of the ASR systems with respect to the ROVER output. We refer to the latter quantity as the diversity of the ASR system ensemble because it measures the spread of the ASR hypotheses about the ROVER hypothesis. This result explains the trade-off between the WER of the individual systems and the diversity of the ensemble. We support this result through ROVER experiments using multiple ASR systems trained on standard data sets with the Kaldi toolkit. We use the proposed theorem to explain the lower WERs obtained by ASR confidence-weighted ROVER as compared to word frequency-based ROVER. We also quantify the reduction in ROVER WER with increasing diversity of the N-best list. We finally present a simple discriminative framework for jointly training multiple diverse acoustic models (AMs) based on the proposed theorem. Our framework generalizes and provides a theoretical basis for some recent intuitive modifications to well-known discriminative training criterion for training diverse AMs.
- Research Article
4
- 10.7717/peerj-cs.1981
- Apr 3, 2024
- PeerJ Computer Science
In today's world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people's daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model's performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.
- Research Article
3
- 10.2174/2210327911666210118143758
- Jan 1, 2022
- International Journal of Sensors, Wireless Communications and Control
Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.
- Research Article
- 10.1080/02687038.2026.2621235
- Feb 16, 2026
- Aphasiology
Background The clinical adoption of automated speech recognition (ASR) systems in speech pathology for people with aphasia (PwA) is primarily limited by performance and stability issues. The possibility for detailed evaluation of such systems is critical, as clinical assessment of patients and their treatments are highly dependent on their accuracy and precision. Aims This study aims to address the limitations of ASR evaluation metrics, such as Word Error Rate (WER), by introducing a granular evaluation approach. The goal is to develop and apply a framework for analyzing ASR system performances, specifically on speech phenomena relevant to clinical aphasia assessment, including disfluencies, grammatical errors, and filler words. English and French data are used to test the ASR transcription performances for both languages. Methods & Procedures We present a novel evaluation framework and accompanying software tool to assess ASR models on aphasic speech. Using annotated transcripts from the AphasiaBank database, the framework measures performance across multiple linguistic phenomena by mapping ASR-generated transcriptions against reference CHAT-encoded transcripts. Key metrics such as Character Error Rate (CER) are computed per-phenomenon using Levenshtein distance, enabling a fine-grained analysis of transcription accuracy. To demonstrate the variability in performance, we applied the framework to three different ASR models, highlighting fluctuations across speech phenomena. Outcomes & Results Evaluation of the Whisper Large-V3 model on the English AphasiaBank dataset revealed significant variability in transcription accuracy across speech phenomena. CER ranged from 25.28% on unannotated words to 87.82% on filled pauses, a delta of over 62% points. Prompting reduced the CER for filled pauses from 87.82% to 44.04%, demonstrating that task-specific tuning can yield substantial gains. Compared to control participants, PwA transcripts obtained CERs that were up to 20% points higher, depending on the speech phenomenon. The largest gaps appeared in phonological and grammatical errors. In a multi-language evaluation, transcription performance on French aphasic speech was notably worse, with CERs averaging 32% higher than their English counterparts. Morphological errors in French reached up to 43.33%, compared to 29.60% in English, and semantic errors showed deltas exceeding 20% points. Conclusions These indicators can help better understand the underlying behaviours of ASR models, thus offering better insights into their reliability. With this framework, ASR systems can be evaluated to detect any weaknesses and highlight areas for improvement, ensuring their suitability for clinical use. By offering a granular analysis of these factors, the tool empowers clinicians and researchers to make informed decisions about integrating ASR systems into speech pathology workflows.
- Research Article
8
- 10.3233/jifs-213332
- Aug 10, 2022
- Journal of Intelligent & Fuzzy Systems
Speech recognition has now become ubiquitous and plays an inevitable role in almost all sectors. Numerous works have been proposed on speech recognition; however, more accurate transcriptions are not possible. Exploration of various studies related to spell correction implies that several kinds of research have been carried out in this field but still it is a very challenging problem. This led to the need for a new spell corrector framework capable of leveraging the performance of the automatic speech recognition (ASR) system. The proposed work unveils state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) based spell correction module developed on top of the deep recurrent neural network (RNN) based ASR system. The impact of BERT-based spell correction on the ASR system is evaluated on three different accent datasets in the perspective of word error rate (WER), character error rate (CER), and Bilingual evaluation understudy (BLEU) score. The experimental results inferred that the enhanced spell correction module is efficacious in detecting and correcting spell errors, by achieving the WER of 5.025% on librispeech corpus, 6.35% on voxforge, and 7.05% on NPTEL corpus.
- Research Article
22
- 10.1007/s10772-020-09671-5
- Jan 22, 2020
- International Journal of Speech Technology
In this paper, the improvements in the recently implemented Kannada speech recognition system is demonstrated in detail. The Kannada automatic speech recognition (ASR) system consists of ASR models which are created by using Kaldi, IVRS call flow and weather and agricultural commodity prices information databases. The task specific speech data used in the recently developed spoken dialogue system had high level of different background noises. The different types of noises present in collected speech data had an adverse effect on the on line and off line speech recognition performances. Therefore, to improve the speech recognition accuracy in Kannada ASR system, a noise reduction algorithm is developed which is a fusion of spectral subtraction with voice activity detection (SS-VAD) and minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC) estimator. The noise elimination algorithm is added in the system before the feature extraction part. An alternative ASR models are created using subspace Gaussian mixture models (SGMM) and deep neural network (DNN) modeling techniques. The experimental results show that, the fusion of noise elimination technique and SGMM/DNN based modeling gives a better relative improvement of 7.68% accuracy compared to the recently developed GMM-HMM based ASR system. The least word error rate (WER) acoustic models could be used in spoken dialogue system. The developed spoken query system is tested from Karnataka farmers under uncontrolled environment.
- Research Article
8
- 10.1007/s10278-018-0085-8
- Apr 30, 2018
- Journal of Digital Imaging
The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.
- Conference Article
16
- 10.1109/asru.2007.4430114
- Jan 1, 2007
In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.
- Research Article
20
- 10.1080/10400435.2022.2061085
- Apr 16, 2022
- Assistive Technology
In recent years, rapid advancements have taken place for automatic speech recognition (ASR) systems and devices. Though ASR technologies have increased, the accessibility of these novel interaction systems is underreported and may present difficulties for people with speech impediments. In this article, we attempt to identify gaps in current research on the interaction between people with dysarthria and ASR systems and devices. We cover the period from 2011, when Siri (the first and the leading commercial voice assistant) was launched, to 2020. The review employs an interaction framework in which each element (user, input, system, and output) contributes to the interaction process. To select the articles for review, we conducted a search of scientific databases and academic journals. A total of 36 studies met the inclusion criteria, which included use of the word error rate (WER) as a measurement for evaluating ASR systems. This review determines that challenges in interacting with ASR systems persist even in light of the most recent commercial technologies. Further, understanding of the entire interaction process remains limited; thus, to improve this interaction, the recent progress of ASR systems must be elucidated.
- Conference Article
7
- 10.1109/icdipc.2019.8723681
- May 1, 2019
Most on-device and cloud processing based automatic speech recognition (ASR) systems had poor recognition performance due to the noisy speech signals corrupted by various kinds of background noises such as vehicle, train, aircraft, fan, wind, rain, air-conditioner, and machinery noises which are unavoidable realistic scenarios. In this paper, we propose a novel speech signal quality assessment (SSQA) method for automatically assessing the quality of a recorded speech signal before processing on-device and sending the recorded data to the cloud server. The proposed method is based on the spectrogram feature and two-dimensional convolutional neural networks (2D-CNNs). The proposed SSQA method is evaluated using a large scale of noise-free speech and noisy speech signals which are corrupted with various kinds of noises with different noise levels. Results show that the 2D-CNN based method had an average Se=90.92%, Sp=98.44% and OA =96.44%. The method had better results in detecting the noisy speech segments. Results showed that there is confusion in performing the manual labelling of noise-free and noisy speech segments. Therefore, the noise-free and noisy speech signals are given to the publicly available ASR system to obtain the corresponding text. Then the word error rate (WER) and character error rate (CER) metrics were used to know the level of noise wherein the ASR system fails to correctly recognize its text. In this way, the noise level is determined for each of the noises to label the recorded speech signal into acceptable and unacceptable speech segments. The proposed quality-aware ASR system has great potential in improving the lifetime of the battery of the portable ASR devices and reducing the bandwidth and speech recognition software utilization costs in the case of cloud processing based ASR system.
- Research Article
- 10.1080/03772063.2015.1119660
- Apr 26, 2016
- IETE Journal of Research
ABSTRACTThere is a large gap between the capabilities of the human beings and the automatic speech recognition (ASR) systems in recognizing pronunciation variations. ASR systems learn from labelled speech corpus, whereas the humans use “Everyday Speech” for adapting pronunciation variability. Labelling huge speech corpus in real time is impracticable, expensive, and time-consuming. In this paper, we present an algorithm using unsupervised learning techniques for adapting the easily available “Everyday Speech”. The algorithm is implemented using Java. The data sets are extracted from CMUDICT pronunciation directory, TIMIT database, and “The Hindu” daily newspaper. The results have shown a significant improvement in word error rate (WER) measurements over the existing ASR system. The addition of dynamic pronunciation model enables the ASR system to learn from the unlabelled “Everyday Speech” and makes it inexpensive and fast.
- Research Article
6
- 10.1016/j.eswa.2024.124119
- May 1, 2024
- Expert Systems With Applications
This study explores the feasibility of constructing a small-scale speech recognition system capable of competing with larger, modern automated speech recognition (ASR) systems in both performance and word error rate (WER). Our central hypothesis posits that a compact transformer-based ASR model can yield comparable results, specifically in terms of WER, to traditional ASR models while challenging contemporary ASR systems that boast significantly larger computational sizes. The aim is to extend ASR capabilities to under-resourced languages with limited corpora, catering to scenarios where practitioners face constraints in both data availability and computational resources. The model, comprising a compact convolutional neural network (CNN) and transformer architecture with 2.214 million parameters, challenges the conventional wisdom that large-scale transformer-based ASR systems are essential for achieving high accuracy. In comparison, contemporary ASR systems often deploy over 300 million parameters. Trained on a modest dataset of approximately 3000 h – significantly less than the 50,000 h used in larger systems – the proposed model leverages the Common Voice and LibriSpeech datasets. Evaluation on the LibriSpeech test-clean and test-other datasets produced character error rates (CERs) of 6.40% and 16.73% and WERs of 16.03% and 35.51% respectively. Comparisons with existing architectures showcase the efficiency of our model. A gated recurrent unit (GRU) architecture, albeit achieving lower error rates, incurred a computational cost 24 times larger than our proposed model. Large-scale transformer architectures, while achieving marginally lower WERs (2%–4% on LibriSpeech test-clean), require 200 times more parameters and 53,000 additional hours of training data. Modern large language models are used to improve the WERs, but require large computational resources. To further enhance performance, a small 4-g language model was integrated into our end-to-end ASR model, resulting in improved WERs. The overarching goal of this work is to provide a practical solution for practitioners dealing with limited datasets and computational resources, particularly in the context of under-resourced languages.
- Conference Article
2
- 10.21437/interspeech.2021-198
- Aug 30, 2021
Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.
- Conference Article
9
- 10.1109/indicon56171.2022.10039926
- Nov 24, 2022
Automated correction of common and systematic errors caused by the ASR system is part of the post-processing of automatic speech recognition (ASR). The ASR system’s output is prone to grammatical, spelling and phonetic problems. The sentences which are the outputs from Automated Speech Recognition (ASR) models have been refined using error correction approaches to obtain a reduced word error rate (WER) and Character Error Rate (CER) than the initial ASR outputs. In this paper we propose a model that reduces the word error rate and character error rate of the speech recognized mathematical equations. The proposed model is the encoder-decoder with attention model.
- Book Chapter
4
- 10.1007/978-3-031-11644-5_22
- Jan 1, 2022
Analyzing teachers’ discourse plays a fundamental role in educational research and is a key component of Teaching Analytics. This usually involves transcribing lessons from audio recordings. As the number of recordings grows, Automatic Speech Recognition (ASR) systems gain popularity as a means for transcribing these recordings. However, most ASR systems are trained over very specific domains which usually involve read text and low environmental noise. This suggests common ASR systems available on the market may underperform over classroom recordings, as they present a unique type of environmental sound and spontaneous discourse, as opposed to the usual training domains. To address this challenge we present a system that automatically transcribes classroom discourse in a robust way with regard to classroom noise, which was trained over few annotated data. In particular, we used a state-of-the-art ASR model based on wav2vec 2.0 and fine-tuned it over a 6-h dataset of 4th to 8th grade Chilean lessons. We found that by leveraging its transformer-based architecture and changing the fine-tuning domain to classroom recordings, we can obtain a more accurate and robust transcriber for this source of audio which outperforms other popular cloud-based systems up to 35% and 59% in terms of Word and Character Error Rates, respectively. This work contributes by using state-of-the-art ASR techniques to develop a tool which is particularly adapted to classroom environments, making it robust and more reliable with regard to their environmental sound and the way teaching discourse is carried out.KeywordsAutomatic speech recognitionClassroom discourseTeaching analyticsTransfer learning
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.