Armenian Speech Recognition System: Acoustic and Language Models

Varuzhan H Baghdasaryan

doi:10.51542/ijscia.v3i5.7

Abstract

Nowadays automatic speech recognition (ASR) is an important task for machines. Several applications such as speech translation, virtual assistants and voice bot systems use ASR to understand human speech. Most of the research and available models are for widely used languages, such as English, German, French, Chinese and Spanish. This paper presents the Armenian speech recognition system. As a result of this research developed acoustic and language models for the Armenian language (modern ASR systems combine acoustic and language models to achieve higher accuracy). RNN-based Baidu’s Deep Speech deep neural network was used to train the acoustic model, and the KenLM toolkit was used to train the probabilistic language model. The acoustic model was trained and validated on ArmSpeech Armenian native speech corpus using transfer-learning and data augmentation techniques and tested on the Common Voice Armenian database. The language model was built based on the texts scraped from Armenian news websites. Final models are small in size and can be run and do real-time speech-to-text tasks on IoT devices. Testing on the Common Voice Armenian database the model gave 0.902565 WER and 0.305321 CER without the language model, and 0.552975 WER and 0.285904 CER with the language model. The paper aims to describe environment setup, data collection, acoustic and language models training processes, as well as final results and benchmarks.

Full Text