Abstract

Convolutional Neural Network has shown to achieve a state of the art performance in computer vision. They have also progressively become popular in speech recognition and other natural language processing tasks. In this study, we aim at designing a light-weight Convolutional Neural Network architecture for the under-resourced end-to-end speech recognition task. We present a carefully designed 1-dimensional Convolutional deep neural network architecture that could achieve reasonable accuracy to be cascaded with spoken content retrieval systems. We explored the usage of Convolutional Neural Networks with Connectionist Temporal Classification under resource-constrained conditions. The possibility of having an end-to-end system with the best decoding result keeping the network parameters and computational time minimum is also shown. The paper presents the results on the Amharic syllable-based end-to-end speech recognition system implementing the designed model. The architecture is trained and evaluated on ≈70 hours of Amharic read-speech, audiobooks, and multi-genre radio programs. On the development set, we report a character error rate of 12.60% and a syllable error rate of 27.28% without language-models integrated. Likewise, on the test set 18.38% character error rate and 27.71% syllable error rate is reached.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call