Spoken keyword detection using recurrent neural network language model

Shuhei Koike,Akinobu Lee

doi:10.1121/1.4969757

Abstract

Recently, spoken keyword detection (SKD) systems that listen live audio and tries to capture user's utterances with specific keywords has been extensively studied, in order to realize a truly usable hands-free speech interface in our life: “Okay google” in Google products, “Hey, Siri” on Apple products and “Alexa” on Amazon Alexa / Amazon echo. Since the keyword detectors are typically built from large number of actually spoken keywords, they are irreplaceable and the users of such systems are forced to speak only the keyword they defined. On the other hand, a SKD method based on keyword-filler model using generic phoneme model and garbage filler sequence model is promising in that, since the acoustic pattern of the keyword will can be given as phoneme sequence, it is task-dependent and anyone can use his own keyword. In this study, an improvement of the latter method is studied. Recurrent neural network language model (RNNLM) is introduced as linguistic constraint for both filler-filler and filler-keyword instead of N-gram, and experimental result on actual spoken data for a spoken dialogue system showed that our method can improve the keyword detection performance.

Full Text