Large Vocabulary Continuous Speech Recognition Research Articles

To learn auditory filterbanks, recently, we have proposed an unsupervised learning model based on convolutional restricted Boltzmann machine RBM with rectified linear units. In this paper, theory, training algorithm of our proposed model, and detailed analysis of learned filterbank are being presented. Learning of the model with different databases shows that the model is able to learn cochlear-like impulse responses that are localized in frequency-domain. An auditory-like scale obtained from filterbanks learned from clean and noisy datasets resembles the Mel scale, which is known to mimic perceptually relevant aspect of speech. We have experimented with both cepstral denoted as ConvRBM-CC as well as filterbank features denoted as ConvRBM-BANK. On large vocabulary continuous speech recognition task, we achieved relative improvement of 7.21-17.8% in word error rate WER compared to Mel frequency cepstral coefficient MFCC features and 1.35-6.82% compared to Mel filterbank FBANK features. On AURORA 4 multicondition training database, the relative improvement in WER by 4.8-13.65% was achieved using a Hybrid Deep Neural Network-Hidden Markov Model DNN-HMM system with ConvRBM-CC features. Using ConvRBM-BANK features, we achieve absolute reduction of 1.25-3.85% in WER on AURORA 4 test sets compared to FBANK features. A context-dependent DNN-HMM system further improves performance with a relative improvement of 3.6-4.6% on an average for bigram 5k and tri-gram 5k language models. Hence, our proposed learned filterbank performs better than traditional MFCC and Mel-filterbank features for both clean and multicondition automatic speech recognition ASR tasks. A system combination of ConvRBM-BANK and FBANK features further improve performance in all ASR tasks. Cross-domain experiments where subband filters trained on one database are used for the ASR task of another database show that model learns generalized representations of speech signals.

A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that has achieved a really wide popularity in general audio coding, and in speech, too. However, the algorithm is lossy in nature and introduces distortion into spectral and temporal characteristics of a signal. In this paper we study its impact on automatic speech recognition (ASR). We show that with decreasing MP3 bitrates the major source of ASR performance degradation is deep spectral valleys (i.e. bins with almost zero energy) caused by the masking effect of the MP3 algorithm. We demonstrate that these unnatural gaps in spectrum can be effectively compensated by adding a certain amount of noise to the distorted signal. We provide theoretical background for this approach where we show that the added noise affects mainly the spectral valleys. They are filled by the noise while the spectral bins with speech remain almost unchanged. This helps to restore a more natural shape of log spectrum and cepstrum, and consequently has a positive impact on ASR performance. In our previous work, we have proposed two types of the signal dithering (noise addition) technique, one applied globally, the other in a more selective way. In this paper, we offer a more detailed insight into their performance. We provide results from many experiments where we test them in various scenarios, using a large vocabulary continuous speech recognition (LVCSR) system, acoustic models based on gaussian-mixture model (GMM) as well as on deep-neural network (DNN), and multiple speech databases in three languages (Czech, English and German). Our results prove that both the proposed techniques, and the selective dithering method, in particular, yield consistent compensation of the negative impact of the MP3 compressed speech on ASR performance.

Large Vocabulary Continuous Speech Recognition Research Articles

Related Topics

Articles published on Large Vocabulary Continuous Speech Recognition

Experimenting with lipreading for large vocabulary continuous speech recognition

Resource2Vec: Linked Data distributed representations for term discovery in automatic speech recognition

Improvements in Serbian Speech Recognition Using Sequence-Trained Deep Neural Networks

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Speech bottleneck feature extraction method based on overlapping group lasso sparse deep neural network

Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition

Joint Learning of Distance Metric and Query Model for Posteriorgram-Based Keyword Search

Classification-based spoken text selection for LVCSR language modeling

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

A Fast Speaker Adaptation Approach for Large Vocabulary Continuous Speech Recognition

Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition

Phone Synchronous Speech Recognition With CTC Lattices

Reducing latency for language identification based on large-vocabulary continuous speech recognition

Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition

Dithering techniques in automatic recognition of speech corrupted by MP3 compression: Analysis, solutions and experiments

Maximum Likelihood Nonlinear Transformations Based on Deep Neural Networks

The use of pitch in Large-Vocabulary Continuous Speech Recognition System

Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach

음성인식 기반 응급상황관제

Wise teachers train better DNN acoustic models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Vocabulary Continuous Speech Recognition Research Articles

Related Topics

Articles published on Large Vocabulary Continuous Speech Recognition

Experimenting with lipreading for large vocabulary continuous speech recognition

Resource2Vec: Linked Data distributed representations for term discovery in automatic speech recognition

Improvements in Serbian Speech Recognition Using Sequence-Trained Deep Neural Networks

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

Speech bottleneck feature extraction method based on overlapping group lasso sparse deep neural network

Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition

Joint Learning of Distance Metric and Query Model for Posteriorgram-Based Keyword Search

Classification-based spoken text selection for LVCSR language modeling

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

A Fast Speaker Adaptation Approach for Large Vocabulary Continuous Speech Recognition

Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition

Phone Synchronous Speech Recognition With CTC Lattices

Reducing latency for language identification based on large-vocabulary continuous speech recognition

Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition

Dithering techniques in automatic recognition of speech corrupted by MP3 compression: Analysis, solutions and experiments

Maximum Likelihood Nonlinear Transformations Based on Deep Neural Networks

The use of pitch in Large-Vocabulary Continuous Speech Recognition System

Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach

음성인식 기반 응급상황관제

Wise teachers train better DNN acoustic models