Speech data collection system for Kazakh language

Darkhan Kuanyshbay,Arailym Kuanyshbayeva,Yedilkhan Amirgaliyev,Olimzhon Baimuratov

doi:10.1109/icecco53203.2021.9663771

Abstract

Speech data in most of the languages that have a low resource doesn’t even exist. Therefore, producing speech corpora is very challenging and requires tremendous amount of time. Kazakh language due to its lack of popularity considered to be low-resource language. This paper provides an overview on many data collection techniques, marking some of the issues related to a particular method. The main aim of this paper is to present crowd sourcing web-based tool called “Kazakh recorder” which accessible on the website and designed to make the collection of Kazakh speech data more conveniently and quickly. Moreover, this section provides a statistics of people (age, gender, number of sentences) who made a contribution on collecting this speech data. Using this tool, we have collected over 50 hours of speech data 65 different native speakers, each having on average 500 sentences pronounced in Kazakh language.

Full Text