Abstract

State-of-the-art speech recognition is witnessing its golden era as convolutional neural network (CNN) becomes the leader in this domain. CNN based acoustic models have been shown significant improvement in speech recognition tasks. This improvement is achieved due to the special components of CNN, i.e., local filters, weight sharing, and pooling. However, lack of core understanding renders this powerful model as a black-box machine. Although, CNN is performing well in speech recognition still further investigation will help in achieving better recognition rate. Pooling is a very important component of CNN that reduces the dimensionality of the feature-map and offers compact feature representation. Various pooling methods like max pooling, average pooling, stochastic pooling, mixed pooling, $${\text{L}}_{\text{p}}$$ pooling, multi-scale orderless pooling, and spectral pooling have their own advantages and disadvantages. In this paper, we deeply explore the state-of-the-art pooling for speech recognition tasks. This paper also helps to investigate that which pooling method performs well in which condition. This work explores different pooling methods for different architectures on Hindi speech dataset. The experimental results show that max pooling performs well when tested for clean speech and stochastic pooling works well in the noisy environment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call