We present automatic speech recognition (ASR) systems for Tamil and Kannada based on subword modeling to effectively handle unlimited vocabulary arising due to the highly agglutinative nature of the languages. We propose a variant of the byte pair encoding (BPE) algorithm named extended-BPE, and Morfessor tool to segment each word into its subwords. We have effectively incorporated maximum likelihood and Viterbi estimation techniques with weighted finite state transducers framework in these algorithms to learn the subword dictionary from a large text corpus. Using the learned subword dictionary, the words in the transcriptions of the training data are segmented into subwords. We train deep neural network ASR systems that recognize the subword sequence for any given test speech utterance. The output subword sequence is then post-processed using deterministic rules to get the final word sequence. Because of this subword design, the number of words that can be recognized is much larger than the number of words in the training corpus. For Tamil ASR, we use 152 hours of data for training and 65 hours for testing, whereas for Kannada ASR, we use 275 hours for training and 72 hours for testing. Upon experimenting with different combinations of segmentation and estimation techniques, we find that the word error rate (WER) reduces drastically when compared to the baseline word-level ASR, achieving a maximum absolute WER reduction of 6.24% and 6.63% for Tamil and Kannada, respectively. Further, comparing our results with that of an end-to-end ASR model available on Github, we have seen that our subword language models can perform comparable to or exceed that of recent end-to-end ASR models for Kannada and Tamil languages.
Read full abstract