Abstract

AbstractThe two key components of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems are language modeling and acoustic modeling. The language model generates a lexicon, which is a pronunciation dictionary. A lexicon can be created using a variety of approaches. For low-resource languages, rule-based methods are typically employed to build the lexicon. However, because the corpus is often tiny, this methodology does not account for all possible pronunciation variances. As a result, low-resource languages like Malayalam require a method for developing a comprehensive lexicon as the corpus grows. In this work, we explored deep learning based encoder-decoder models for grapheme-to-phoneme (G2P) conversion in Malayalam. Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) encoder models with varying embedding dimensions were used to create the encoder model. The performance of the deep learning models used for G2P conversion was measured using the Word Error Rate (WER) and Phoneme Error Rate (PER). With 1024 embedding dimensions, the encoder using the BiLSTM model had the maximum accuracy of 98.04% and the lowest PER of 2.57% at the phoneme level, and the highest accuracy of 90.58% and the lowest WER of 9.42% at the word level.KeywordsLexiconG2PLanguage modelingBiLSTMLSTMEncoder-decoder architecture

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call