End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Pooja Gambhir,Amita Dev,Poonam Bansal,Deepak Kumar Sharma

doi:10.1145/3606019

Abstract

Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Jan 15, 2024
Citations: 4

Similar Papers

Effects of Varying Noise Levels and Lighting Levels on Multimodal Speech and Visual Gesture Interaction with Aerobots
Ayodeji Opeyemi Abioye ... Sarvapali D Ramchurn
Applied Sciences | VOL. 9
Ayodeji Opeyemi Abioye, et. al.Ayodeji Opeyemi Abioye ... Sarvapali D Ramchurn
19 May 2019
Applied Sciences | VOL. 9

Multimodal controls for soldier/swarm interaction
E C Haas ... M Fields
-
E C Haas, et. al.E C Haas ... M Fields
01 Jul 2011
01 Jul 2011

Temporal binding of multimodal controls for dynamic map displays
Ellen C. Haas ... Gardner McCullough
-
Ellen C. Haas, et. al.Ellen C. Haas ... Gardner McCullough
14 Nov 2011
14 Nov 2011

Acoustic interference cancellation for a voice-driven interface in smart TVs
Jeong-Sik Park ... Sang-Hoon Kim
IEEE Transactions on Consumer Electronics | VOL. 59
Jeong-Sik Park, et. al.Jeong-Sik Park ... Sang-Hoon Kim
01 Feb 2013
IEEE Transactions on Consumer Electronics | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing