LOTUS-SOC: A social media speech corpus for Thai LVCSR in noisy environments

Patcharika Chootrakool,Chai Wutiwiwatchai,Vataya Chunwijitra,Sawit Kasuriya,Phuttapong Sertsi

doi:10.1109/icsda.2016.7919017

Patcharika Chootrakool, Chai Wutiwiwatchai + Show 3 more

https://doi.org/10.1109/icsda.2016.7919017

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

A Large vOcabulary Thai continUous Speech — SOCial media corpus (LOTUS-SOC) has been developed since 2015. Twitter messages were selected as a source for sound recording through a mobile application. At present, 172 hours of speech from 208 speakers were recorded, while more 192 speakers to achieve the total 400 speakers are under recording. We design the data to balance gender and 8 types of noise conditions. This paper describes the detail of the corpus design and development process. The corpus aims for building a Thai large vocabulary continuous speech recognizer (LVCSR) which could better deal with spoken-style input speech under various noisy environments. To assess the corpus, different kinds of Thai LVCSR systems have been built. Evaluations show that systems additionally trained by LOTUS-SOC are more robust to noisy environments. With the best setting and training method, the GMM-based and DNN-based systems achieve 35.2% and 17.1% word error rates respectively.

Full Text