Abstract

With the advent of recurrent neural network transducer (RNN-T) model, the performance of keyword spotting (KWS) systems has greatly improved. However, the KWS systems, employed for wake-word detection, still rely on the availability of keyword specific training data for achieving reasonable performance on each keyword. With a goal to improve the KWS performance for these keywords without having to collect additional natural speech data, we explore Text-To-Speech (TTS) technology to synthetically generate training data for such keywords. Employing an RNN-T based KWS model, already well trained on large keyword-independent natural speech dataset, as a seed model, we run adaptation experiments using the generated keyword-specific TTS data. Besides observing a considerable improvement in the overall performance for the low-resource keywords, we find that the performance improvement with TTS-generated training data, similar to natural speech data, depends on speaker diversity, amount of data per speaker and data simulation. We get additional improvement in performance by selectively adapting specific parts of the RNN-T model and gain key insights into different architectural constructs of RNN-T model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call