Most Indian languages are spoken in units of syllables. However, speech recognition systems developed so far for Indian languages have generally used characters or phonemes as modeling units. This work evaluates the performance of syllable-based modeling units in end-to-end speech recognition for several Indian languages. The text is represented in 3 different forms: native script, Sanskrit library phonetics (SLP1) encoding, and syllables, and tokenized with sub-word units like character, byte-pair encoding (BPE), and unigram language modeling (ULM). The performances of these tokens in monolingual training and cross-lingual transfer learning are compared. Syllable-based BPE/ULM subword units give promising results in the monolingual setup if the dataset is sufficiently diverse to represent the syllable distribution in the language. For the Vāksañcayaḥ dataset in Sanskrit, syllable-BPE tokens achieve state-of-the-art results. The capability of syllable-BPE units to complement SLP1-character models through a pretraining–finetuning setup is also evaluated. For Sanskrit, syllable-BPE achieves better word error rates (WER) than the pretraining–finetuning approaches. For Tamil and Telugu, both result in comparable WERs. SLP1-character units are largely better than syllable-based units for cross-lingual transfer learning.
Read full abstract