Convolutional Neural Networks for Speech Recognition

Ossama Abdel-Hamid,Gerald Penn,Abdel-Rahman Mohamed,Hui Jiang,Dong Yu,Li Deng

doi:10.1109/taslp.2014.2339736

Abstract

Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

Highlights

T HE aim of automatic speech recognition (ASR) is the transcription of human speech into spoken words
Our hybrid convolutional neural networks (CNNs)-hidden Markov model (HMM) approach delegates temporal variability to the HMM, while convolving along the frequency axis creates a degree of invariance to small frequency shifts, which normally occur in actual speech signals due to speaker differences
We have proposed a new, limited weight sharing scheme that can handle speech features in a better way than the full weight sharing that is standard in previous CNN architectures such as those used in image processing

Summary

INTRODUCTION

T HE aim of automatic speech recognition (ASR) is the transcription of human speech into spoken words. All of these factors have had a significant impact upon performance This historical deconstruction is important because the premise of the present paper is that very wide input contexts and domain-appropriate representational invariance are so important to the recent success of neural-network-based acoustic models that an ANN-HMM architecture embodying these advantages can in principle outperform other ANN architectures of potentially unlimited depth for at least some tasks. We will continue to use HMMs in our model for handling variation along the time axis, but apply convolution on the frequency axis of the spectrogram This endows the learned acoustic features with a tolerance to small shifts in frequency, such as those that may arise from differing vocal tract lengths, and has led to a significant improvement over DNNs of similar complexity on TIMIT speaker-independent phone recognition, with a relative phone error rate reduction of about 8.5%.

DEEP NEURAL NETWORKS: A REVIEW

CONVOLUTIONAL NEURAL NETWORKS AND THEIR USE IN ASR

Organization of the Input Data to the CNN

Convolution Ply

Pooling Ply

Learning Weights in the CNN

Pretraining CNN Layers

Treatment of Energy Features

The Overall CNN Architecture

Benefits of CNNs for ASR

CNN WITH LIMITED WEIGHT SHARING FOR ASR

Pretraining of LWS-CNN

EXPERIMENTS

Speech Data and Analysis

TIMIT Phone Recognition Results

80 FEATURE MAPS PER FREQUENCY BAND FOR LWS

Large Vocabulary Speech Recognition Results

Findings

CONCLUSIONS

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Oct 1, 2014
Citations: 1873	License type: other-oa

R Discovery Prime

R Discovery Prime

Convolutional Neural Networks for Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

Exploring convolutional neural network structures and optimization techniques for speech recognition
Ossama Abdel-Hamid ... Dong Yu
-
Ossama Abdel-Hamid, et. al.Ossama Abdel-Hamid ... Dong Yu
25 Aug 2013
25 Aug 2013

Investigation of CNN-Based Acoustic Modeling for Continuous Hindi Speech Recognition
Tripti Choudhary ... Atul Bansal
-
Tripti Choudhary, et. al.Tripti Choudhary ... Atul Bansal
12 Sep 2021
12 Sep 2021

Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features
Feifei Xiong ... Bernd T Meyer
EURASIP Journal on Advances in Signal Processing | VOL. 2015
Feifei Xiong, et. al.Feifei Xiong ... Bernd T Meyer
05 Aug 2015
EURASIP Journal on Advances in Signal Processing | VOL. 2015

Convolutional neural network pre-trained with projection matrices on linear discriminant analysis
Takashi Fukuda ... Osamu Ichikawa
-
Takashi Fukuda, et. al.Takashi Fukuda ... Osamu Ichikawa
01 Mar 2016
01 Mar 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Convolutional Neural Networks for Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing