Developing Deep Learning Models for Turkish Automatic Punctuation Restoration Using a Novel Dataset
Today, automatic speech recognition systems are widely used by individuals, institutions, and organizations. However, the lack of punctuation marks in the texts produced by these systems complicates the comprehensibility of the texts and hinders advanced text analysis. Consequently, there is an increasing need for automatic punctuation restoration models. A review of existing studies reveals that most research focuses on the English language, while languages like Turkish, which belong to the agglutinative language group, have been relatively underexplored. In this study, a unique dataset has been created for Turkish automatic punctuation restoration. Models developed using convolutional neural networks, transformer encoder, and FnetEncoder layers were trained and analyzed with this dataset. The hyper-parameters of the developed models were optimized using Bayesian optimization. The analysis results showed that the best performance was achieved by the transformer encoder-based model with an overall F-score of 90.10%. Additionally, all models were observed to be more successful in predicting periods and spaces compared to commas. This study contributes to the literature by focusing on the Turkish language and offers a novel approach to automatic punctuation restoration with the creation of a new dataset and the developed models.
- Conference Article
10
- 10.23919/apsipa.2018.8659622
- Nov 1, 2018
In this paper, we propose to leverage end-to-end automatic speech recognition (ASR) systems for assisting deep neural network-hidden Markov model (DNN-HMM) hybrid ASR systems. The DNN-HMM hybrid ASR system, which is composed of an acoustic model, a language model and a pronunciation model, is known to be the most practical architecture in ASR field. On the other hand, much attention has been paied in recent studies to the end-to-end ASR systems that are fully composed of neural networks. It is known that they can yield comparative performance without introducing heuristic operations. However, one problem is that the end-to-end ASR systems sometimes suffer from redundant generation and ommission of important words in text generation phases. This is because these systems cannot explicitly consider the connection between the input speech and the output text. Therefore, our idea is to regard the end-to-end ASR systems as neural speech-to-text language models (NS2TLMs) and to use them for rescoring hypotheses generated in the DNN-HMM hybrid ASR systems. This enables us to leverage the end-to-end ASR systems while avoiding the generation issues because the DNN-HMM hybrid ASR systems can generate speech-aligned hypotheses. It is expected that the NS2TLMs improve the DNN-HMM hybrid ASR systems because the end-to-end ASR systems correctly handle short-duration utterances. In our experiments, we use state-of-the-art DNN-HMM hybrid ASR systems with convolutional and long short-term memory recurrent neural network acoustic models and end-to-end ASR systems based on attetional encoder-decoder. We demonstrate that our proposed method can yield a better ASR performance than both the DNN-HMM hybrid ASR system and the end-to-end ASR system.
- Research Article
1
- 10.24425/aoa.2020.134058
- Aug 25, 2020
- Archives of Acoustics
Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker’s headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.
- Book Chapter
- 10.1007/978-981-19-0095-2_56
- Jun 23, 2022
Information processing has become ubiquitous. The process of deriving speech from transcription is known as automatic speech recognition systems. In recent days, most of the real-time applications such as home computer systems, mobile telephones, and various public and private telephony services have been deployed with automatic speech recognition (ASR) systems. Inspired by commercial speech recognition technologies, the study on automatic speech recognition (ASR) systems has developed an immense interest among the researchers. This paper is an enhancement of convolution neural networks (CNNs) via a robust feature extraction model and intelligent recognition systems. First, the news report dataset is collected from a public repository. The collected dataset is subjective to different noises that are preprocessed by min–max normalization. The normalization technique linearly transforms the data into an understandable form. Then, the best sequence of words, corresponding to the audio based on the acoustic and language model, undergoes feature extraction using Mel-frequency Cepstral Coefficients (MFCCs). The transformed features are then fed into convolutional neural networks. Hidden layers perform limited iterations to get robust recognition systems. Experimental results have proved better accuracy of 96.17% than existing ANN.KeywordsSpeech recognitionTextMel featuresRecognition accuracyConvolutional neural networks
- Book Chapter
8
- 10.1016/b978-0-12-818130-0.00002-7
- Jan 1, 2019
- Intelligent Speech Signal Processing
Chapter 2 - End-to-End Acoustic Modeling Using Convolutional Neural Networks
- Research Article
- 10.3390/electronics11121831
- Jun 9, 2022
- Electronics
This paper presents a low-latency streaming on-device automatic speech recognition system for inference. It consists of a hardware acoustic model implemented in a field-programmable gate array, coupled with a software language model running on a smartphone. The smartphone works as the master of the automatic speech recognition system and runs a three-gram language model on the acoustic model output to increase accuracy. The smartphone calculates and sends the Mel-spectrogram of an audio stream with 80 ms unit input from the built-in microphone of the smartphone to the field-programmable gate array every 80 ms. After ~35 ms, the field-programmable gate array sends the calculated word-piece probability to the smartphone, which runs the language model and generates the text output on the smartphone display. The worst-case latency from the audio-stream start time to the text output time was measured as 125.5 ms. The real-time factor is 0.57. The hardware acoustic model is derived from a time-depth-separable convolutional neural network model by reducing the number of weights from 115 M to 9.3 M to decrease the number of multiply-and-accumulate operations by two orders of magnitude. Additionally, the unit input length is reduced from 1000 ms to 80 ms, and to minimize the latency, no future data are used. The hardware acoustic model uses an instruction-based architecture that supports any sequence of convolutional neural network, residual network, layer normalization, and rectified linear unit operations. For the LibriSpeech test-clean dataset, the word error rate of the hardware acoustic model was 13.2% and for the language model, it was 9.1%. These numbers were degraded by 3.4% and 3.2% from the original convolutional neural network software model due to the reduced number of weights and the lowering of the floating-point precision from 32 to 16 bit. The automatic speech recognition system has been demonstrated successfully in real application scenarios.
- Research Article
12
- 10.1016/j.snb.2024.135272
- Jan 6, 2024
- Sensors and Actuators B: Chemical
A novel electronic nose classification prediction method based on TETCN
- Research Article
77
- 10.3390/healthcare10030494
- Mar 8, 2022
- Healthcare
Brain tumor is one of the most aggressive diseases nowadays, resulting in a very short life span if it is diagnosed at an advanced stage. The treatment planning phase is thus essential for enhancing the quality of life for patients. The use of Magnetic Resonance Imaging (MRI) in the diagnosis of brain tumors is extremely widespread, but the manual interpretation of large amounts of images requires considerable effort and is prone to human errors. Hence, an automated method is necessary to identify the most common brain tumors. Convolutional Neural Network (CNN) architectures are successful in image classification due to their high layer count, which enables them to conceive the features effectively on their own. The tuning of CNN hyperparameters is critical in every dataset since it has a significant impact on the efficiency of the training model. Given the high dimensionality and complexity of the data, manual hyperparameter tuning would take an inordinate amount of time, with the possibility of failing to identify the optimal hyperparameters. In this paper, we proposed a Bayesian Optimization-based efficient hyperparameter optimization technique for CNN. This method was evaluated by classifying 3064 T-1-weighted CE-MRI images into three types of brain tumors (Glioma, Meningioma, and Pituitary). Based on Transfer Learning, the performance of five well-recognized deep pre-trained models is compared with that of the optimized CNN. After using Bayesian Optimization, our CNN was able to attain 98.70% validation accuracy at best without data augmentation or cropping lesion techniques, while VGG16, VGG19, ResNet50, InceptionV3, and DenseNet201 achieved 97.08%, 96.43%, 89.29%, 92.86%, and 94.81% validation accuracy, respectively. Moreover, the proposed model outperforms state-of-the-art methods on the CE-MRI dataset, demonstrating the feasibility of automating hyperparameter optimization.
- Book Chapter
3
- 10.5772/10112
- Aug 16, 2010
Communication using speech is inherently natural, with this ability of communication unconsciously acquired in a step-by-step manner throughout life. In order to explore the benefits of speech communication in devices, there have been many research works performed over the past several decades. As a result, automatic speech recognition (ASR) systems have been deployed in a range of applications, including automatic reservation systems, dictation systems, navigation systems, etc. Due to increasing globalization, the need for effective interlingual communication has also been growing. However, because of the fact that most people tend to speak foreign languages with variant or influent pronunciations, this has led to an increasing demand for the development of non-native ASR systems (Goronzy et al., 2001). In other words, a conventional ASR system is optimized with native speech; however, non-native speech has different characteristics from native speech. That is, non-native speech tends to reflect the pronunciations or syntactic characteristics of the mother tongue of the non-native speakers, as well as the wide range of fluencies among non-native speakers. Therefore, the performance of an ASR system evaluated using non-native speech tends to severely degrade when compared to that of native speech due to the mismatch between the native training data and the nonnative test data (Compernolle, 2001). A simple way to improve the performance of an ASR system for non-native speech would be to train the ASR system using a non-native speech database, though in reality the number of non-native speech samples available for this task is not currently sufficient to train an ASR system. Thus, techniques for improving non-native ASR performance using only small amount of non-native speech are required. There have been three major approaches for handling non-native speech for ASR: acoustic modeling, language modeling, and pronunciation modeling approaches. First, acoustic modeling approaches find pronunciation differences and transform and/or adapt acoustic models to include the effects of non-native speech (Gruhn et al., 2004; Morgan, 2004; Steidl et al., 2004). Second, language modeling approaches deal with the grammatical effects or speaking style of non-native speech (Bellegarda, 2001). Third, pronunciation modeling approaches derive pronunciation variant rules from non-native speech and apply the derived rules to pronunciation models for non-native speech (Amdal et al., 2000; FoslerLussier, 1999; Goronzy et al., 2004; Gruhn et al., 2004; Raux, 2004; Strik et al., 1999). Source: Advances in Speech Recognition, Book edited by: Noam R. Shabtai, ISBN 978-953-307-097-1, pp. 164, September 2010, Sciyo, Croatia, downloaded from SCIYO.COM
- Research Article
17
- 10.1016/j.apacoust.2020.107386
- May 5, 2020
- Applied Acoustics
Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit
- Research Article
14
- 10.1109/taslp.2014.2303295
- Mar 1, 2014
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Diversity or complementarity of automatic speech recognition (ASR) systems is crucial for achieving a reduction in word error rate (WER) upon fusion using the ROVER algorithm. We present a theoretical proof explaining this often-observed link between ASR system diversity and ROVER performance. This is in contrast to many previous works that have only presented empirical evidence for this link or have focused on designing diverse ASR systems using intuitive algorithmic modifications. We prove that the WER of the ROVER output approximately decomposes into a difference of the average WER of the individual ASR systems and the average WER of the ASR systems with respect to the ROVER output. We refer to the latter quantity as the diversity of the ASR system ensemble because it measures the spread of the ASR hypotheses about the ROVER hypothesis. This result explains the trade-off between the WER of the individual systems and the diversity of the ensemble. We support this result through ROVER experiments using multiple ASR systems trained on standard data sets with the Kaldi toolkit. We use the proposed theorem to explain the lower WERs obtained by ASR confidence-weighted ROVER as compared to word frequency-based ROVER. We also quantify the reduction in ROVER WER with increasing diversity of the N-best list. We finally present a simple discriminative framework for jointly training multiple diverse acoustic models (AMs) based on the proposed theorem. Our framework generalizes and provides a theoretical basis for some recent intuitive modifications to well-known discriminative training criterion for training diverse AMs.
- Conference Article
- 10.1145/3428757.3429971
- Nov 30, 2020
Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers. In this research we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.
- Research Article
- 10.37591/.v9i2.3253
- Oct 30, 2019
- Trends in Electrical Engineering
Automatic speech recognition (ASR) systems that facilitate voice based search and information retrieval have gained importance recently. While the performance of ASR systems for Indian languages have improved in the recent past. They have yet to gain wide acceptability as much as the ASR systems for English spoken by Indians. Almost all Indians learn English as a second or third language. So, the phoneme set and the prosody of native language of Indians influences the acoustic characteristics of spoken English. Since Indians speak a wide variety of languages, the acoustic characteristics of English spoken by Indians vary a lot. Thus, the recognition accuracy of Indian English could be improved by employing native language dependent English ASR systems. This approach requires automatic identification of the native language of the speaker. Here, we report the performance of an automatic Native Language Identification (NLI) system that recognises the native language of the speaker as Assamese or Bengali or Bodo after analysis of an English sentence spoken by the speaker. Training and performance evaluation of a NLI system needs appropriate linguistic resources. These include (a) speech data, in each of the 3 languages from several speakers, (b) corresponding word level transcriptions and (c) a pronunciation dictionary. While pronunciation dictionaries for English language are freely available, spoken English by speakers of the above-mentioned three languages and transcriptions are not publicly available. So, we created a relevant speech database. We recorded English spoken by native speakers, both male and female, of these three scheduled languages. Each speaker read 100 sentences out of a set of 700 English sentences; these were either proverbs or digit sequences. Each sentence contained 5 to 8 words. The digitised speech, recorded under ambient conditions using a laptop, had the following characteristics: 16000 Hz, 16 bit, mono. The database contains spoken English from 35 native Assamese speakers, 33 Bengali and 30 Bodo speakers. In order to carry out a threefold evaluation of the performance of the system, the speakers from each language were grouped into 3 subsets such that each subset contains nearly equal number of speakers. In each fold, one subset was designated as test data, and the remaining two subsets were used to train the system. We used Kaldi, an open source ASR toolkit, for implementation of the NLI system. As a first step in the development of NLI system, we implemented three English ASR systems, each trained using training data from one of the three languages: Assamese, Bengali and Bodo. A three-state Hidden Markov Model (HMM) represented a phone. Each state of HMM was associated with a Gaussian mixture model. We used Mel frequency cepstral coefficients and their temporal derivatives as features, and bigram as the language model. In order to identify the native language of a speaker, the test speech file was fed to each of the three ASR systems. An ASR system not only generates the decoded word sequence, but also the corresponding log likelihood. The NLI system follows the maximum likelihood criterion. The language corresponding to the ASR system that yielded the highest likelihood for the test speech was declared as the native language of the speaker. The overall accuracy of the NLI system was computed as the unweighted average recall, computed from the confusion matrix. The NLI accuracy of the system, averaged over threefold cross evaluations, was 59% for test speech of just 3 seconds. The confusion was largest among Assamese and Bengali languages as both are close members of Indo-Aryan language family. In contrast, Bodo belongs to the Sino-Tibetan language family. We discuss the performance of the NLI system using different models such as context-dependent and context independent HMMs, employing Gaussian mixture model or deep neural network to estimate the likelihood of a feature vector emitted from a state of HMM. Keywords: Automatic identification, automatic speech recognition, native language identification, voice-based search, information retrieval
- Research Article
1
- 10.13052/jmm1550-4646.1869
- Jul 18, 2022
- Journal of Mobile Multimedia
In recent years, the usage of smart phones increased rapidly. Such smartphones can be controlled by natural human speech signals with the help of automatic speech recognition (ASR). Since a smartphone is a small gadget, it has various limitations like computational power, battery, and storage. But the performance of the ASR system can be increased only when it is in online mode since it needs to work from the remote server. The ASR system can also work in offline mode, but the performance and accuracy are less when compared with online ASR. To overcome the issues that occur in the offline ASR system, we proposed a model that combines the bidirectional gated recurrent unit (Bi-GRU) with convolution neural network (CNN). This model contains one layer of CNN and two layers of gated Bi-GRU. CNN has the potential to learn local features. Similarly, Bi-GRU has expertise in handling long-term dependency. The capacity of the proposed model is higher when compared with traditional CNN. The proposed model achieved nearly 5.8% higher accuracy when compared with the previous state-of-the-art methods.
- Research Article
270
- 10.1016/j.asoc.2020.106580
- Jul 28, 2020
- Applied Soft Computing
A Novel Medical Diagnosis model for COVID-19 infection detection based on Deep Features and Bayesian Optimization
- Research Article
- 10.1177/00368504251330003
- Apr 1, 2025
- Science Progress
Reciprocating piston pump is an important power equipment in coal mine production, so the research on condition monitoring and fault diagnosis of reciprocating piston pump is of great significance. It is challenging to extract fault information from monitoring data due to the complex underground environment and serious noise. The existing methods have the problems of insensitive feature extraction and low diagnostic accuracy. Based on this, a new fault diagnosis method for reciprocating piston pumps based on feature fusion of convolutional neural network (CNN) and transformer encoder is proposed. In this method, a multi-scale CNN encoder and transformer encoder are used to extract local and global features of signals in parallel, and a multi-scale convolution module is used to improve the diversity of local features. At the same time, before using the transformer encoder to extract global features, patch segmentation of monitoring signals is carried out in combination with the phase of the reciprocating piston pump crankshaft to reduce the influence of data randomness on global features and improve the interpretability of global features. In addition, a feature fusion module is constructed to realize the interaction and fusion of local and global features and improve the comprehensive characterization ability of the device state. The proposed method is applied to the fault diagnosis task of reciprocating piston pump. The experimental results show that the proposed method achieves a diagnostic accuracy of 99.145% ± 0.1576%, demonstrating its excellent performance. This accuracy rate is significantly higher than that of other existing methods, indicating that the proposed method can more accurately diagnose the faults of reciprocating piston pumps.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.