Analysis of French phonetic idiosyncrasies for accent recognition
Speech recognition systems have made tremendous progress since the last few decades. They have developed significantly in identifying the speech of the speaker. However, there is a scope of improvement in speech recognition systems in identifying the nuances and accents of a speaker. It is known that any specific natural language may possess at least one accent. Despite the identical word phonemic composition, if it is pronounced in different accents, we will have sound waves, which are different from each other. Differences in pronunciation, in accent and intonation of speech in general, create one of the most common problems of speech recognition. If there are a lot of accents in language we should create the acoustic model for each separately. We carry out a systematic analysis of the problem in the accurate classification of accents. We use traditional machine learning techniques and convolutional neural networks, and show that the classical techniques are not sufficiently efficient to solve this problem. Using spectrograms of speech signals, we propose a multi-class classification framework for accent recognition. In this paper, we focus our attention on the French accent. We also identify its limitation by understanding the impact of French idiosyncrasies on its spectrograms.
- Conference Article
7
- 10.1109/icccnt51525.2021.9579986
- Jul 6, 2021
The Automatic Speech Recognition (ASR) system is becoming an unavoidable feature in the automotive, nowadays. The main goal of ASR is to get the machine to understand the spoken language. The automotive speech recognition system fulfills the hands on the wheel, eyes on the road principle of automotive designs by availing various physical controls of the vehicle via speech commands for the driver. Automotive speech recognition differs from other speech recognition systems in the environment in which it is operating. The speech recognition system performance is degraded by noise sources within the automotive cabin. For low Signal to Noise Ratio (SNR) signals, the speech recognition system alone cannot interpret the speech signal and results in the failure of the automotive speech recognition system. This paper aims to improve the intelligibility of speech and its quality in the automotive environment, by processing the speech signal before feeding it to the speech recognition system. We have performed experiments on classical speech enhancement techniques and the Deep Neural Network (DNN) based models. Thus a Wavenet - Long Short Term Memory (LSTM) Network is created that can process the signals to enhances the speech quality and suppresses the noise content, so that the speech recognition system can work accurately, even in the low SNR signal.
- Research Article
25
- 10.1016/j.specom.2008.05.004
- May 20, 2008
- Speech Communication
Combined speech enhancement and auditory modelling for robust distributed speech recognition
- Research Article
111
- 10.1007/s11831-023-09899-9
- Apr 4, 2023
- Archives of Computational Methods in Engineering
Convolutional neural network (CNN) has shown dissuasive accomplishment on different areas especially Object Detection, Segmentation, Reconstruction (2D and 3D), Information Retrieval, Medical Image Registration, Multi-lingual translation, Local language Processing, Anomaly Detection on video and Speech Recognition. CNN is a special type of Neural Network, which has compelling and effective learning ability to learn features at several steps during augmentation of the data. Recently, different interesting and inspiring ideas of Deep Learning (DL) such as different activation functions, hyperparameter optimization, regularization, momentum and loss functions has improved the performance, operation and execution of CNN Different internal architecture innovation of CNN and different representational style of CNN has significantly improved the performance. This survey focuses on internal taxonomy of deep learning, different models of vonvolutional neural network, especially depth and width of models and in addition CNN components, applications and current challenges of deep learning.
- Research Article
42
- 10.31083/j.jin.2020.01.24
- Jan 1, 2020
- Journal of Integrative Neuroscience
Electroencephalography is the recording of brain electrical activities that can be used to diagnose brain seizure disorders. By identifying brain activity patterns and their correspondence between symptoms and diseases, it is possible to give an accurate diagnosis and appropriate drug therapy to patients. This work aims to categorize electroencephalography signals on different channels' recordings for classifying and predicting epileptic seizures. The collection of the electroencephalography recordings contained in the dataset attributes 179 information and 11,500 instances. Instances are of five categories, where one is the symptoms of epilepsy seizure. We have used traditional, ensemble methods and deep machine learning techniques highlighting their performance for the epilepsy seizure detection task. One dimensional convolutional neural network, ensemble machine learning techniques like bagging, boosting (AdaBoost, gradient boosting, and XG boosting), and stacking is implemented. Traditional machine learning techniques such as decision tree, random forest, extra tree, ridge classifier, logistic regression, K-Nearest Neighbor, Naive Bayes (gaussian), and Kernel Support Vector Machine (polynomial, gaussian) are used for classifying and predicting epilepsy seizure. Before using ensemble and traditional techniques, we have preprocessed the data set using the Karl Pearson coefficient of correlation to eliminate irrelevant attributes. Further accuracy of classification and prediction of the classifiers are manipulated using k-fold cross-validation methods and represent the Receiver Operating Characteristic Area Under the Curve for each classifier. After sorting and comparing algorithms, we have found the convolutional neural network and extra tree bagging classifiers to have better performance than all other ensemble and traditional classifiers.
- Research Article
8
- 10.1109/taslp.2021.3104193
- Jan 1, 2021
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
We investigate the potential of stochastic neural networks for learning effective waveform-based acoustic models. The waveform-based setting, inherent to fully end-to-end speech recognition systems, is motivated by several comparative studies of automatic and human speech recognition that associate standard non-adaptive feature extraction techniques with information loss which can adversely affect robustness. Stochastic neural networks, on the other hand, are a class of models capable of incorporating rich regularization mechanisms into the learning process. We consider a deep convolutional neural network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric 1D convolutions to extract relevant spectro-temporal patterns while gradually compressing the structured high dimensional representation generated by the parametric block. We rely on a probabilistic parametrization of the proposed neural architecture and learn the model using stochastic variational inference. This requires evaluation of an analytically intractable integral defining the Kullback-Leibler divergence term responsible for regularization, for which we propose an effective approximation based on the Gauss-Hermite quadrature. Our empirical results demonstrate a superior performance of the proposed approach over comparable waveform-based baselines and indicate that it could lead to robustness. Moreover, the approach outperforms a recently proposed deep convolutional neural network for learning of robust acoustic models with standard FBANK features.
- Conference Article
5
- 10.1109/iraset52964.2022.9737976
- Mar 3, 2022
Artificial intelligence-based speech recognition systems are already available and capable of recognizing the French language. Still, it is quite time-consuming to compare which one will be effective for an assistant robot. The study aims to select the best French-language speech recognition system with the least error in a real environment. In this paper, we present related works on how an Automatic Speech Recognition (ASR) system works, the models used by each of its components, several open-source French datasets, and the frequently used evaluation techniques. Next, we compare deep learning-based speech recognition APIs and pre-trained models for French on two different datasets using the Word Error Rate (WER) metric. The experimental results reveal that Google's Speech-to-Text API outperforms the other systems, namely VOSK API, Wav2vec 2.0, QuartzNet, and Speech Brain's Convolutional, Recurrent, and Fully-connected Networks (CRDNN) model.
- Research Article
2
- 10.14201/adcaij.29191
- Aug 27, 2024
- ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal
Recently, one of the most common approaches used in speech recognition is deep learning. The most advanced results have been obtained with speech recognition systems created using convolutional neural network (CNN) and recurrent neural networks (RNN). Since CNNs can capture local features effectively, they are applied to tasks with relatively short-term dependencies, such as keyword detection or phoneme- level sequence recognition. This paper presents the development of a deep learning and speech command recognition system. The Google Speech Commands Dataset has been used for training. The dataset contained 65.000 one-second-long words of 30 short English words. That is, %80 of the dataset has been used in the training and %20 of the dataset has been used in the testing. The data set consists of one-second voice commands that have been converted into a spectrogram and used to train different artificial neural network (ANN) models. Various variants of CNN are used in deep learning applications. The performance of the proposed model has reached %94.60.
- Research Article
24
- 10.1109/tim.2012.2190344
- Sep 1, 2012
- IEEE Transactions on Instrumentation and Measurement
A field-programmable gate array (FPGA)-based robust speech measurement and recognition system is the focus of this paper, and the environmental noise problem is its main concern. To accelerate the recognition speed of the FPGA-based speech recognition system, the discrete hidden Markov model is used here to lessen the computation burden inherent in speech recognition. Furthermore, the empirical mode decomposition is used to decompose the measured speech signal contaminated by noise into several intrinsic mode functions (IMFs). The IMFs are then weighted and summed to reconstruct the original clean speech signal. Unlike previous research, in which IMFs were selected by trial and error for specific applications, the weights for each IMF are designed by the genetic algorithm to obtain an optimal solution. The experimental results in this paper reveal that this method achieves a better speech recognition rate for speech subject to various environmental noises. Moreover, this paper also explores the hardware realization of the designed speech measurement and recognition systems on an FPGA-based embedded system with the System-On-a-Chip (SOC) architecture. Since the central-processing-unit core adopted in the SOC has limited computation ability, this paper uses the integer fast Fourier transform (FFT) to replace the floating-point FFT to speed up the computation for capturing speech features through a mel-frequency cepstrum coefficient. The result is a significant reduction in the calculation time without influencing the speech recognition rate. It can be seen from the experiments in this paper that the performance of the implemented hardware is significantly better than that of existing research.
- Research Article
- 10.5121/ijci.2023.120220
- Mar 11, 2023
- International Journal on Cybernetics & Informatics
Recently, many researchers have focused on building and improving speech recognition systems to facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system has become an important and common tool from games to translation systems, robots, and so on. However, there is still a need for research on speech recognition systems for low-resource languages. This article deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients (MFCCs) feature extraction method and three different deep neural networks including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP), and two hybrid models of CNN and RNN. We evaluate our models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms. This study obtained the impressive result of 98.365% average accuracy.
- Research Article
- 10.5121/ijnlc.2023.12203
- Apr 29, 2023
- International Journal on Natural Language Computing
Recently, many researchers have focused on building and improving speech recognition systems to facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system has become an important and common tool from games to translation systems, robots, and so on. However, there is still a need for research on speech recognition systems for low-resource languages. This article deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients (MFCCs) feature extraction method and three different deep neural networks including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms. This study obtained the impressive result of 98.365% average accuracy.
- Research Article
- 10.15276/hait.06.2023.13
- Oct 12, 2023
- Herald of Advanced Information Technology
The relevance of solving the problem of facial emotion recognition on human images in the creation of modern intelligent systems of computer vision and human-machine interaction, online learning and emotional marketing, health care and forensics, machine graphics and game intelligence is shown. Successful examples of technological solutions to the problem of facial emotion recognition using transfer learning of deep convolutional neural networks are shown. But the use of such popular datasets as DISFA, CelebA, AffectNet, for deep learning of convolutional neuralnetworks does not give good results in terms of the accuracy of emotion recognition, because almost all training sets have fundamental flaws related to errors in their creation, such as the lack of data of a certain class, imbalance of classes, subjectivity and ambiguity of labeling, insufficient amount of data for deep learning, etc. It is proposed to overcome the noted shortcomings of popular datasets for emotion recognition by adding to the training sample additional pseudo-labeled images with human emotions, on which recognition occurs with high accuracy. The aim of the research is to increase the accuracy of facial emotion recognitionon the image of a human by developing a pseudo-labeling method for transfer learning of a deep neural network. To achieve the aim, the following tasks were solved: a convolutional neural network model, previously trained on the ImageNet set using the transfer learning method, was adjusted on the RAF-DB data set to solve emotion recognition tasks; a pseudo-labeling method of the RAF−DB set data was developed for semi-supervised learning of a convolutional neural network model for the task of facial emotion recognition; the accuracy of facial emotion recognition was analyzed based on the developed convolutional neural network model and the method of pseudo-labeling of RAF-DB set data for its correction. It is shown that the use of the developed method of pseudo-labeling data and transfer learning of the MobileNet V1 convolutional neural network model allowed to increase the accuracy of facial emotion recognitionon the images of the RAF-DB dataset by 2 percent (from 76 to 78%) according to the F1 estimate. Atthe same time, taking into account the significant imbalance of the classes, for the 7 main emotions in the trainingset, we have a significant increase in the accuracy of recognizing a few representatives of such emotions as surprise (from 71 to 77%), fearful(from 64 to 69%), sad (from 72 to 76%), angrywith (from 64 to 74%), neutral(from 66 to 71%). The accuracy of recognizing the emotion of happy, which is the most common, decreased (from 91 to 86 %) Thus, it can be concluded that the use of the developed pseudo-labeling method gives good results in overcoming such shortcomings of datasets for deep learning of convolutional neural networks such as lack of data of a certain type, imbalance of classes, insufficient amount of data for deep learning, etc.
- Research Article
5
- 10.1016/j.ijmedinf.2023.105213
- Sep 9, 2023
- International Journal of Medical Informatics
Machine learning-based speech recognition system for nursing documentation – A pilot study
- Research Article
31
- 10.3390/drones7060382
- Jun 6, 2023
- Drones
Unmanned aerial vehicles (UAVs) are increasingly being integrated into the domain of precision agriculture, revolutionizing the agricultural landscape. Specifically, UAVs are being used in conjunction with machine learning techniques to solve a variety of complex agricultural problems. This paper provides a careful survey of more than 70 studies that have applied machine learning techniques utilizing UAV imagery to solve agricultural problems. The survey examines the models employed, their applications, and their performance, spanning a wide range of agricultural tasks, including crop classification, crop and weed detection, cropland mapping, and field segmentation. Comparisons are made among supervised, semi-supervised, and unsupervised machine learning approaches, including traditional machine learning classifiers, convolutional neural networks (CNNs), single-stage detectors, two-stage detectors, and transformers. Lastly, future advancements and prospects for UAV utilization in precision agriculture are highlighted and discussed. The general findings of the paper demonstrate that, for simple classification problems, traditional machine learning techniques, CNNs, and transformers can be used, with CNNs being the optimal choice. For segmentation tasks, UNETs are by far the preferred approach. For detection tasks, two-stage detectors delivered the best performance. On the other hand, for dataset augmentation and enhancement, generative adversarial networks (GANs) were the most popular choice.
- Conference Article
10
- 10.1109/iraniancee.2017.7985272
- May 1, 2017
Convolutional neural networks (CNNs) have been recently used for acoustic modeling and feature extraction in speech recognition systems, where their inputs have been speech spectrogram or even raw speech signal. In this paper, we propose to use CNN for learning a filter bank and robust feature extraction from the noisy speech spectrum. In the proposed manner, CNN inputs are noisy speech spectrum and its outputs are denoised logarithm of Mel filter bank energies (LMFBs) and convolution filter size is fixed. Furthermore, we propose to use multiple CNNs with different convolution filter sizes to provide different frequency resolutions for feature extraction from the speech spectrum. We named this method as Multiresolution CNN (MRCNN). We behave in two manners with multiple CNNs outputs. In the first manner, we concatenate all outputs to construct the feature vector. In the second manner, we choose some outputs from each CNN based on the convolution filter size and concatenate them to construct feature vector. Recognition accuracy on Aurora 2 database, show that MRCNN with two CNNs and corresponding 1×6 and 1×20 convolution filter sizes outperforms CNNs and other MRCNNs setting in extracting robust features.
- Research Article
23
- 10.1155/2021/1874584
- Jul 26, 2021
- Mobile Information Systems
With the acceleration of global integration, the demand for English instruction is increasingly rising. On the other hand, Chinese English learners struggle to learn spoken English due to the limited English learning environment and teaching conditions in China. The advancement of artificial intelligence technology and the advancement of language teaching and learning techniques have ushered in a new era of language learning and teaching. Deep learning technology makes it possible to solve this problem. Speech recognition and assessment technology are at the heart of language learning, and speech recognition technology is the foundation. Because of the complex changes in speech pronunciation, a large amount of speech signal data, the high dimension of speech characteristic parameters, and a large amount of speech recognition and evaluation computation, the large volume of speech signal processing requires higher requirements of hardware and software resources and algorithms. However, traditional speech recognition algorithms, such as dynamic time-warped algorithms, hidden Markov models, and artificial neural networks, have their advantages and disadvantages. They have encountered unprecedented bottlenecks, so it is difficult to improve their accuracy and speed. To solve these problems, this paper focuses on evaluating the multimedia teaching effect of college English. A multilevel residual convolutional neural network algorithm for oral English pronunciation recognition is proposed based on a deep convolutional neural network. The experiments show that our algorithm can assist learners in identifying inconsistencies between their pronunciation and standard pronunciation and correcting pronunciation errors, resulting in improved oral English learning performance.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.