Voicer: A Crowd Sourcing Tool for Speech Data Collection

Darshana Buddhika,Ranula Liyadipita,Sudeepa Nadeeshan,Uthayasanker Thayasivam,Hasini Witharana,Sanath Jayasena

doi:10.1109/icter.2018.8615521

Abstract

Speech corpora do not exist for most low-resource languages. Thus, creating speech corpora for a language of such a nature is challenging and involves a significant amount of time and effort. This paper provides an overview of related data collection strategies, highlighting a few issues prevalent in the existing approaches. The objectives of this paper encompass firstly the introduction of an open-source tool called “Voicer” that is accessible via both handheld devices and computers that can be used to conduct a speech data collection for a specific domain in a short span of time irrespective of the language. Secondly, it demonstrates the power of the tool, utilizing the same to build a Sinhala speech corpus that consists of 10 hours of speech data for 39 different sentences in the banking domain. Finally, this paper provides a framework to evaluate a speech data corpus along with the lessons learned during the process of data collection with a view to contributing towards future researches.

Full Text