Abstract

This paper presents an efficient method to build a corpus to train natural language understanding (NLU) modules. Conventional corpus creation methods involve a common cycle: a subject is given a specific situation where the subject operates a device by voice, and then the subject speaks one utterance to execute the task. In these methods, many subjects are required in order to build a large-scale corpus, which causes a problem of increasing lead time and financial cost. To solve this problem, we propose to incorporate a “probing question” into the cycle. Specifically, after a subject speaks one utterance, the subject is asked to think of alternative utterances to execute the same task. In this way, we obtain many utterances from a small number of subjects. An evaluation of the proposed method applied to interview-based corpus creation shows that the proposed method reduces the number of subjects by 41% while maintaining morphological diversity in a corpus and morphological coverage for user utterances spoken to commercial devices. It also shows that the proposed method reduces the total time for interviewing subjects by 36% compared with the conventional method. We conclude that the proposed method can be used to build a useful corpus while reducing lead time and financial cost.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call