Large Vocabulary Speech Recognition Research Articles

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units—effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported—in the order of 70–80% for speaker-dependent and 80–95% for speaker-independent systems—highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system’s discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage.

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features derived from the DNN are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

Large Vocabulary Speech Recognition Research Articles

Related Topics

Articles published on Large Vocabulary Speech Recognition

Deep convolutional neural networks-based features for Indonesian large vocabulary speech recognition

Audio to Sign Language Translator

Keyword Search Based on Unsupervised Pre-Trained Acoustic Models

RETRACTED: Design of english translation platform based on embedded system software simulation

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

Acoustic model topology optimization for large vocabulary speech recognition

A usage of the syllable unit based on morphological statistics in Korean large vocabulary continuous speech recognition system

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition.

Evolution-Strategy-Based Automation of System Development for High-Performance Speech Recognition

Ameliorated language modelling for lecture speech recognition of Indian English

Automatic Speech Recognition Errors Detection and Correction: A Review

Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

Lattice Based Transcription Loss for End-to-End Speech Recognition

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Croatian Large Vocabulary Automatic Speech Recognition

Differentiable Pooling for Unsupervised Acoustic Model Adaptation

Feature mapping using far-field microphones for distant speech recognition

Building DNN acoustic models for large vocabulary speech recognition

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

From Feedforward to Recurrent LSTM Neural Networks for Language Modeling

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Vocabulary Speech Recognition Research Articles

Related Topics

Articles published on Large Vocabulary Speech Recognition

Deep convolutional neural networks-based features for Indonesian large vocabulary speech recognition

Audio to Sign Language Translator

Keyword Search Based on Unsupervised Pre-Trained Acoustic Models

RETRACTED: Design of english translation platform based on embedded system software simulation

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

Acoustic model topology optimization for large vocabulary speech recognition

A usage of the syllable unit based on morphological statistics in Korean large vocabulary continuous speech recognition system

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition.

Evolution-Strategy-Based Automation of System Development for High-Performance Speech Recognition

Ameliorated language modelling for lecture speech recognition of Indian English

Automatic Speech Recognition Errors Detection and Correction: A Review

Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

Lattice Based Transcription Loss for End-to-End Speech Recognition

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Croatian Large Vocabulary Automatic Speech Recognition

Differentiable Pooling for Unsupervised Acoustic Model Adaptation

Feature mapping using far-field microphones for distant speech recognition

Building DNN acoustic models for large vocabulary speech recognition

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

From Feedforward to Recurrent LSTM Neural Networks for Language Modeling