Abstract

Convolutional neural networks (CNNs) have been recently used for acoustic modeling and feature extraction in speech recognition systems, where their inputs have been speech spectrogram or even raw speech signal. In this paper, we propose to use CNN for learning a filter bank and robust feature extraction from the noisy speech spectrum. In the proposed manner, CNN inputs are noisy speech spectrum and its outputs are denoised logarithm of Mel filter bank energies (LMFBs) and convolution filter size is fixed. Furthermore, we propose to use multiple CNNs with different convolution filter sizes to provide different frequency resolutions for feature extraction from the speech spectrum. We named this method as Multiresolution CNN (MRCNN). We behave in two manners with multiple CNNs outputs. In the first manner, we concatenate all outputs to construct the feature vector. In the second manner, we choose some outputs from each CNN based on the convolution filter size and concatenate them to construct feature vector. Recognition accuracy on Aurora 2 database, show that MRCNN with two CNNs and corresponding 1×6 and 1×20 convolution filter sizes outperforms CNNs and other MRCNNs setting in extracting robust features.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.