Abstract

We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.

Highlights

  • We present an open-source Kazakh speech corpus (KSC) constructed to advance the development of speech and language processing applications for the Kazakh language

  • During the Soviet period, the Kazakh language was overwhelmed by the Russian language, which caused a decline in Kazakh language usage (Dave, 2007)

  • In the 1990s, it was declared an official language of Kazakhstan, and many initiatives were launched to increase the number of Kazakh speakers

Read more

Summary

Introduction

We present an open-source Kazakh speech corpus (KSC) constructed to advance the development of speech and language processing applications for the Kazakh language. Kazakh is an agglutinative language with vowel harmony and belongs to the family of Turkic languages. In the 1990s, it was declared an official language of Kazakhstan, and many initiatives were launched to increase the number of Kazakh speakers. Today, it is spoken by over 10 million people in Kazakhstan and by over 3 million people in other countries. By introducing the KSC, we aim to accelerate the penetration of the Kazakh language into the Internet of things (IoT) technologies and to promote research in Kazakh speech processing applications

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call