Automatic Speech Recognition Systems Research Articles

The accuracy of automatic spontaneous speech recognition systems is far from that of trained speech recognition systems. This is due to the fact that spontaneous speech is not as smooth and failure-free as spontaneous speech. Spontaneous speech varies from speaker to speaker: the quality of phonemes’ pronunciation, the presence of pauses, speech disruptions and extralinguistic items (laughing, coughing, sneezing, and chuckling when expressing emotions of irritation, etc.) interrupt the fluency of verbal speech. However, it is worth noting that extralinguistic items very often carry important paralinguistic information, so it is crucial for automatic spontaneous speech recognition systems not only to identify such phenomena and distinguish them from the verbal components of speech but also to classify them. This review presents an analysis of works on the topic of automatic detection and analysis of extralinguistic items in spontaneous speech. Both individual methods and approaches to the recognition of extralinguistic items in a speech stream, and works related to the multiclass classification of isolatedly recorded extralinguistic units are considered and described. The most popular methods of extralinguistic units’ analysis are neural networks, such as deep neural networks and networks based on transformer models. The basic concepts related to the term extralinguistic items are given, the original systematization of extralinguistic items in the Russian language is proposed, the corpus and databases of audio spoken speech both in Russian and in other languages are described, the data sets of extralinguistic items recorded isolatedly are also given. The accuracy of extralinguistic items recognition increases with the following conditions of work with the speech signal: pre-processing of audio signals of items has shown an increase in the accuracy of separately recorded extralinguistic items classification; consideration of context (analysis of several frames of speech signal) and use of filters for smoothing the time series after extraction of feature vectors showed an increase in accuracy in frame-by-frame analysis of the speech signal with spontaneous speech.

Read full abstract

The work presented in this paper aims at enhancing the performance of end-to-end (E2E) speech recognition task for children's speech under low resource conditions. For majority of the languages, there is hardly any speech data from child speakers. Furthermore, even the available children's speech corpora are limited in terms of the number of hours of data. On the other hand, large amounts of adults' speech data are freely available for research as well as commercial purposes. As a consequence, developing an effective E2E automatic speech recognition (ASR) system for children becomes a very challenging task. One may develop an ASR system using adults' speech and then use it to transcribe children's data, but this leads to very poor recognition rates due to the stark differences in the acoustic attributes of adults' and children's speech. In order to overcome these hurdles and to develop a robust children's ASR system employing E2E architecture, we have resorted to several out-of-domain and in-domain data augmentation techniques. For out-of-domain data augmentation, we have explicitly modified adults' speech to render it acoustically similar to that of children's speech before pooling into training. On the other hand, in the case of in-domain data augmentation, we have slightly modified the pitch and duration of children's speech in order to create more data capturing greater diversity. Data augmentation approaches helps in mitigating the ill-effects resulting from the scarcity of data from child domain to a certain extent. This, in turn, reduces the error rates by a large margin. In addition to data augmentation, we have also studied the efficacy of Gamma-tone frequency cepstral coefficients (GFCC) and frequency domain linear prediction (FDLP) technique along with the most commonly used Mel-frequency cepstral coefficients (MFCC) for front-end speech parameterization. Both MFCC as well as GFCC capture and model the spectral envelope of speech. On the other hand, application of linear prediction on the frequency domain representation of speech signal helps to effectively capture the temporal envelope during front-end feature extraction. Employing FDLP features that model the temporal envelope provides important cues for the perception and understanding of stop bursts and, at times, complete phonemes. This motivated us to perform a comparative experimental study of the effectiveness of the three aforementioned front-end acoustic features. In our experimental explorations, the use of proposed data augmentation in combination of FDLP features has shown a relative improvement in character error rate by 67.6% over the baseline system. The combination of data augmentation with MFCC or GFCC features is observed to result in lower recognition performances.

Read full abstract

Automatic Speech Recognition Systems Research Articles

Related Topics

Articles published on Automatic Speech Recognition Systems

Automatic Speech Recognition System to Record Progress Notes in a Mobile EHR: A Pilot Study.

Development of an ASR System for Medical Conversations.

Useful blunders: Can automated speech recognition errors improve downstream dementia classification?

The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy

Automated Measures of Syntactic Complexity in Natural Speech Production: Older and Younger Adults as a Case Study.

Аналитический обзор методов автоматического анализа экстралингвистических компонентов спонтанной речи

Developing children's ASR system under low-resource conditions using end-to-end architecture

Using Voice Technologies to Support Disabled People

An investigation and analysis on automatic speech recognition systems

Linguistic disparities in cross-language automatic speech recognition transfer from Arabic to Tashlhiyt

ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Real-Time Automatic Continuous Speech Recognition System for Kannada Language/Dialects

Training speech recognition models at the National Library of Sweden

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Legendas em árabe geradas por inteligência artificial: insights do sistema de reconhecimento automático de fala do árabe jordaniano da Veed.io

基于喉部振动的语音自动识别系统的设计

Enhancing Armenian Automatic Speech Recognition Performance: A Comprehensive Strategy for Speed, Accuracy, and Linguistic Refinement

Toward Effective Aircraft Call Sign Detection Using Fuzzy String-Matching between ASR and ADS-B Data

Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Automatic Speech Recognition Systems Research Articles

Related Topics

Articles published on Automatic Speech Recognition Systems

Automatic Speech Recognition System to Record Progress Notes in a Mobile EHR: A Pilot Study.

Development of an ASR System for Medical Conversations.

Useful blunders: Can automated speech recognition errors improve downstream dementia classification?

The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy

Automated Measures of Syntactic Complexity in Natural Speech Production: Older and Younger Adults as a Case Study.

Аналитический обзор методов автоматического анализа экстралингвистических компонентов спонтанной речи

Developing children's ASR system under low-resource conditions using end-to-end architecture

Using Voice Technologies to Support Disabled People

An investigation and analysis on automatic speech recognition systems

Linguistic disparities in cross-language automatic speech recognition transfer from Arabic to Tashlhiyt

ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Real-Time Automatic Continuous Speech Recognition System for Kannada Language/Dialects

Training speech recognition models at the National Library of Sweden

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Legendas em árabe geradas por inteligência artificial: insights do sistema de reconhecimento automático de fala do árabe jordaniano da Veed.io

基于喉部振动的语音自动识别系统的设计

Enhancing Armenian Automatic Speech Recognition Performance: A Comprehensive Strategy for Speed, Accuracy, and Linguistic Refinement

Toward Effective Aircraft Call Sign Detection Using Fuzzy String-Matching between ASR and ADS-B Data

Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR