The first paper of ArmSpeech presented an annotative native Armenian speech corpus, its data collection, preprocessing and annotation processes, corpus structure and statistics. The main reason for ArmSpeech creation is to increase Armenian language research resources because according to research there are no free or paid Armenian speech corpora for speech-to-text, text-to-speech and language research. From an NLP perspective, the Armenian language is a low-resourced language despite the fact that The Armenian language is an independent branch of the Indo-European language family and the native language of 12-15 million people. ArmSpeech corpus can be used in natural language processing (NLP) research. The first release of the corpus mainly contains audio clips extracted from free-to-use audiobooks. The total duration of audio clips is 11.77 hours. ArmSpeech’s first release corpus includes 6206 audio clips of multiple speakers of any age, gender and accent. This paper intends to present the ArmSpeech extended version, which is a continuation of the previous work, includes an annotated Armenian speech, and the recording process is based on the volunteer’s voice donation principle. The paper also introduces necessary data collection, pre-processing, recording and annotation stages, final results and statistics of the corpus. The material (text) needed for the recording was collected from the articles on Armenian news websites about lifestyle, culture, sport and politics․ Recording was done by 1 female and 3 male volunteers whose native language is Armenian. The total duration of the data included in the second release is approximately 4 hours and along with the first release, the ArmSpeech corpus becomes 15.7 hours.
Read full abstract