CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation

Anoop C S,A G Ramakrishnan

doi:10.1109/ncc52529.2021.9530162

Abstract

Sanskrit is one of the Indian languages which fares poorly, with regard to the development of language-based tools. In this work, we build a connectionist temporal classification (CTC) based end-to-end large vocabulary continuous speech recognition system for Sanskrit. To our knowledge, this is the first time an end-to-end framework is being used for automatic speech recognition in Sanskrit. A Sanskrit speech corpus with around 5.5 hours of speech data is used for training a neural network with a CTC objective. 80-dimensional mel-spectrogram together with their delta and delta-delta is used as the input features. Spectrogram augmentation techniques are used to effectively increase the amount of training data. The trained CTC acoustic model is assessed in terms of character error rate (CER) on greedy decoding. Weighted finite-state transducer (WFST) decoding is used to obtain the word level transcriptions from the character level probability distributions obtained at the output of the CTC network. The decoder WFST, which maps the CTC output characters to the words in the lexicon, is constructed by composing 3 individual finite-state transducers (FST), namely token, lexicon and grammar. Trigram models trained from a text corpus of 262338 sentences are used for language modeling in grammar FST. The system achieves a word error rate (WER) of 7.64% and a sentence error rate (SER) of 32.44% on the Sanskrit test set of 558 utterances with spectrogram augmentation and WFST decoding. Spectrogram augmentation provides an absolute improvement of 13.86% in WER.

Full Text