Abstract

A Large vOcabulary Thai continUous Speech — SOCial media corpus (LOTUS-SOC) has been developed since 2015. Twitter messages were selected as a source for sound recording through a mobile application. At present, 172 hours of speech from 208 speakers were recorded, while more 192 speakers to achieve the total 400 speakers are under recording. We design the data to balance gender and 8 types of noise conditions. This paper describes the detail of the corpus design and development process. The corpus aims for building a Thai large vocabulary continuous speech recognizer (LVCSR) which could better deal with spoken-style input speech under various noisy environments. To assess the corpus, different kinds of Thai LVCSR systems have been built. Evaluations show that systems additionally trained by LOTUS-SOC are more robust to noisy environments. With the best setting and training method, the GMM-based and DNN-based systems achieve 35.2% and 17.1% word error rates respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call