Abstract

Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

Highlights

  • T HE aim of automatic speech recognition (ASR) is the transcription of human speech into spoken words

  • Our hybrid convolutional neural networks (CNNs)-hidden Markov model (HMM) approach delegates temporal variability to the HMM, while convolving along the frequency axis creates a degree of invariance to small frequency shifts, which normally occur in actual speech signals due to speaker differences

  • We have proposed a new, limited weight sharing scheme that can handle speech features in a better way than the full weight sharing that is standard in previous CNN architectures such as those used in image processing

Read more

Summary

INTRODUCTION

T HE aim of automatic speech recognition (ASR) is the transcription of human speech into spoken words. All of these factors have had a significant impact upon performance This historical deconstruction is important because the premise of the present paper is that very wide input contexts and domain-appropriate representational invariance are so important to the recent success of neural-network-based acoustic models that an ANN-HMM architecture embodying these advantages can in principle outperform other ANN architectures of potentially unlimited depth for at least some tasks. We will continue to use HMMs in our model for handling variation along the time axis, but apply convolution on the frequency axis of the spectrogram This endows the learned acoustic features with a tolerance to small shifts in frequency, such as those that may arise from differing vocal tract lengths, and has led to a significant improvement over DNNs of similar complexity on TIMIT speaker-independent phone recognition, with a relative phone error rate reduction of about 8.5%.

DEEP NEURAL NETWORKS: A REVIEW
CONVOLUTIONAL NEURAL NETWORKS AND THEIR USE IN ASR
Organization of the Input Data to the CNN
Convolution Ply
Pooling Ply
Learning Weights in the CNN
Pretraining CNN Layers
Treatment of Energy Features
The Overall CNN Architecture
Benefits of CNNs for ASR
CNN WITH LIMITED WEIGHT SHARING FOR ASR
Pretraining of LWS-CNN
EXPERIMENTS
Speech Data and Analysis
TIMIT Phone Recognition Results
80 FEATURE MAPS PER FREQUENCY BAND FOR LWS
Large Vocabulary Speech Recognition Results
Findings
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call