Abstract

Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call