ArmSpeech: Armenian Spoken Language Corpus

Varuzhan H Baghdasaryan

doi:10.51542/ijscia.v3i3.25

Abstract

The Armenian language is an independent branch of the Indo-European language family and the official language of the Republic of Armenia and the Republic of Artsakh. According to various reliable sources, an average of 3 million people in Armenia and 10-12 million people in the Armenian Diaspora use the Armenian language as their native language. The largest communities outside of Armenia are in the United States of America, Canada, the Russian Federation, the Islamic Republic of Iran, the French Republic, the Syrian Arab Republic and the Lebanese Republic. This paper presents the ArmSpeech speech corpus. ArmSpeech is a collection of annotated Armenian speech intended for natural language processing (NLP) technologies research and development. ArmSpeech is designed for speech-to-text and text-to-speech purposes but can be used in other domains also (e.g. language identification). Corpus contains 6206 high-quality audio samples: 11 hours 46 minutes and 26 seconds (11.77 hours) of annotated native Armenian speech of multiple speakers of any age, gender and accent. According to the research results, this is the most extensive Armenian speech corpus in the public domain for speech recognition, speech synthesis and spoken language identification systems.

Full Text