Models For Speech Recognition Research Articles

This work introduces updated transcripts, disfluency annotations, and word timings for FluencyBank, which we refer to as FluencyBank Timestamped. This data set will enable the thorough analysis of how speech processing models (such as speech recognition and disfluency detection models) perform when evaluated with typical speech versus speech from people who stutter (PWS). We update the FluencyBank data set, which includes audio recordings from adults who stutter, to explore the robustness of speech processing models. Our update (semi-automated with manual review) includes new transcripts with timestamps and disfluency labels corresponding to each token in the transcript. Our disfluency labels capture typical disfluencies (filled pauses, repetitions, revisions, and partial words), and we explore how speech model performance compares for Switchboard (typical speech) and FluencyBank Timestamped. We present benchmarks for three speech tasks: intended speech recognition, text-based disfluency detection, and audio-based disfluency detection. For the first task, we evaluate how well Whisper performs for intended speech recognition (i.e., transcribing speech without disfluencies). For the next tasks, we evaluate how well a Bidirectional Embedding Representations from Transformers (BERT) text-based model and a Whisper audio-based model perform for disfluency detection. We select these models, BERT and Whisper, as they have shown high accuracies on a broad range of tasks in their language and audio domains, respectively. For the transcription task, we calculate an intended speech word error rate (isWER) between the model's output and the speaker's intended speech (i.e., speech without disfluencies). We find isWER is comparable between Switchboard and FluencyBank Timestamped, but that Whisper transcribes filled pauses and partial words at higher rates in the latter data set. Within FluencyBank Timestamped, isWER increases with stuttering severity. For the disfluency detection tasks, we find the models detect filled pauses, revisions, and partial words relatively well in FluencyBank Timestamped, but performance drops substantially for repetitions because the models are unable to generalize to the different types of repetitions (e.g., multiple repetitions and sound repetitions) from PWS. We hope that FluencyBank Timestamped will allow researchers to explore closing performance gaps between typical speech and speech from PWS. Our analysis shows that there are gaps in speech recognition and disfluency detection performance between typical speech and speech from PWS. We hope that FluencyBank Timestamped will contribute to more advancements in training robust speech processing models.

Read full abstract

Objective. Brain-computer interfaces (BCIs) have the potential to preserve or restore speech in patients with neurological disorders that weaken the muscles involved in speech production. However, successful training of low-latency speech synthesis and recognition models requires alignment of neural activity with intended phonetic or acoustic output with high temporal precision. This is particularly challenging in patients who cannot produce audible speech, as ground truth with which to pinpoint neural activity synchronized with speech is not available.Approach. In this study, we present a new iterative algorithm for neural voice activity detection (nVAD) called iterative alignment discovery dynamic time warping (IAD-DTW) that integrates DTW into the loss function of a deep neural network (DNN). The algorithm is designed to discover the alignment between a patient's electrocorticographic (ECoG) neural responses and their attempts to speak during collection of data for training BCI decoders for speech synthesis and recognition.Main results. To demonstrate the effectiveness of the algorithm, we tested its accuracy in predicting the onset and duration of acoustic signals produced by able-bodied patients with intact speech undergoing short-term diagnostic ECoG recordings for epilepsy surgery. We simulated a lack of ground truth by randomly perturbing the temporal correspondence between neural activity and an initial single estimate for all speech onsets and durations. We examined the model's ability to overcome these perturbations to estimate ground truth. IAD-DTW showed no notable degradation (<1% absolute decrease in accuracy) in performance in these simulations, even in the case of maximal misalignments between speech and silence.Significance. IAD-DTW is computationally inexpensive and can be easily integrated into existing DNN-based nVAD approaches, as it pertains only to the final loss computation. This approach makes it possible to train speech BCI algorithms using ECoG data from patients who are unable to produce audible speech, including those with Locked-In Syndrome.

Read full abstract

Models For Speech Recognition Research Articles

Related Topics

Articles published on Models For Speech Recognition

Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set

Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms

Models and Methods for Speech Separation in Digital Systems

Automatic Screening for Children with Speech Disorder Using Automatic Speech Recognition: Opportunities and Challenges

Adversarial Attack and Defense for Commercial Black-box Chinese-English Speech Recognition Systems

How People Living With Amyotrophic Lateral Sclerosis Use Personalized Automatic Speech Recognition Technology to Support Communication.

Comparative study of CNN, LSTM and hybrid CNN-LSTM model in amazigh speech recognition using spectrogram feature extraction and different gender and age dataset

Sub-layer feature fusion applied to transformer model for automatic speech recognition

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Employing deep learning model to evaluate speech information in acoustic simulations of Cochlear implants

Audiovisual Speech Recognition Method Based on Connectionism

FluencyBank Timestamped: An Updated Data Set for Disfluency Detection and Automatic Intended Speech Recognition.

Design of voice command recognition chip based on heterogeneous acceleration

RANCANG BANGUN APLIKASI MOBILE PEMBELAJARAN HANACARAKA BALI MENGGUNAKAN METODE CNN BERBASIS CLOUD COMPUTING

Application of the conformer model for kazakh speech recognition

Dynamical predictive coding with reservoir computing performs noise-robust multi-sensory speech recognition.

Speech recognition and intelligent translation under multimodal human–computer interaction system

Meta-Adaptable-Adapter: Efficient adaptation of self-supervised models for low-resource speech recognition

Iterative alignment discovery of speech-associated neural activity.

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Models For Speech Recognition Research Articles

Related Topics

Articles published on Models For Speech Recognition

Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set

Development of a code-switched Hindi-Marathi dataset and transformer-based architecture for enhanced speech recognition using dynamic switching algorithms

Models and Methods for Speech Separation in Digital Systems

Automatic Screening for Children with Speech Disorder Using Automatic Speech Recognition: Opportunities and Challenges

Adversarial Attack and Defense for Commercial Black-box Chinese-English Speech Recognition Systems

How People Living With Amyotrophic Lateral Sclerosis Use Personalized Automatic Speech Recognition Technology to Support Communication.

Comparative study of CNN, LSTM and hybrid CNN-LSTM model in amazigh speech recognition using spectrogram feature extraction and different gender and age dataset

Sub-layer feature fusion applied to transformer model for automatic speech recognition

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Employing deep learning model to evaluate speech information in acoustic simulations of Cochlear implants

Audiovisual Speech Recognition Method Based on Connectionism

FluencyBank Timestamped: An Updated Data Set for Disfluency Detection and Automatic Intended Speech Recognition.

Design of voice command recognition chip based on heterogeneous acceleration

RANCANG BANGUN APLIKASI MOBILE PEMBELAJARAN HANACARAKA BALI MENGGUNAKAN METODE CNN BERBASIS CLOUD COMPUTING

Application of the conformer model for kazakh speech recognition

Dynamical predictive coding with reservoir computing performs noise-robust multi-sensory speech recognition.

Speech recognition and intelligent translation under multimodal human–computer interaction system

Meta-Adaptable-Adapter: Efficient adaptation of self-supervised models for low-resource speech recognition

Iterative alignment discovery of speech-associated neural activity.

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.