Abstract
Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.
Highlights
The use of conventional hidden Markov models (HMMs) and deep neural networks (DNNs) of automatic speech recognition (ASR) systems in the preparation of a lexicon, acoustic models, and language models results in complications [1]
We trained our end‐to‐end connectionist temporal classification (CTC)‐attention ASR system using various vocabulary sizes. These vocabulary sizes were selected based on frequently occurring Amharic words
Ularies, which were based on the most frequently occurring words. These results showed that the performance of the phoneme‐based CTC‐attention method was significantly better than that of the character‐based method because the latter is supported by pronunciation‐
Summary
The use of conventional hidden Markov models (HMMs) and deep neural networks (DNNs) of automatic speech recognition (ASR) systems in the preparation of a lexicon, acoustic models, and language models results in complications [1]. These approaches require linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context dependencies [2]. End‐to‐end ASR has grown to be a popular alterna‐. Tive to simplify the conventional ASR model building process. The end‐to‐end ASR system directly transcribes an input sequence of acous‐
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have