Abstract
With the advent of recurrent neural network transducer (RNN-T) model, the performance of keyword spotting (KWS) systems has greatly improved. However, the KWS systems, employed for wake-word detection, still rely on the availability of keyword specific training data for achieving reasonable performance on each keyword. With a goal to improve the KWS performance for these keywords without having to collect additional natural speech data, we explore Text-To-Speech (TTS) technology to synthetically generate training data for such keywords. Employing an RNN-T based KWS model, already well trained on large keyword-independent natural speech dataset, as a seed model, we run adaptation experiments using the generated keyword-specific TTS data. Besides observing a considerable improvement in the overall performance for the low-resource keywords, we find that the performance improvement with TTS-generated training data, similar to natural speech data, depends on speaker diversity, amount of data per speaker and data simulation. We get additional improvement in performance by selectively adapting specific parts of the RNN-T model and gain key insights into different architectural constructs of RNN-T model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.