Abstract

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.

Highlights

  • I NTERACTING with machines via voice is not science fiction anymore

  • While QbE keyword spotting (KWS) based on Recurrent neural networks (RNNs) feature extraction — which is different from the approach outlined in Section II and requires a careful treatment of its specificities— is out of the scope of this paper, we have considered it pertinent to allude to it for the following twofold reason

  • We review and provide some criticism of the most common metrics considered in the field of KWS

Read more

Summary

Introduction

Speech technologies have become ubiquitous in nowadays society. A distinctive feature of voice assistants is that, in order to be used, they first have to be activated by means of a spoken wake-up word or keyword, thereby avoiding running far more computationally expensive automatic speech recognition (ASR) when it is not required [2]. Voice assistants deploy a technology called spoken keyword spotting —or keyword spotting— , which can be understood as a subproblem of ASR [3]. Keyword spotting (KWS) can be defined as the task of identifying keywords in audio streams comprising speech. Apart from activating voice assistants, KWS has plenty of applications such as speech data mining, audio indexing, phone call routing, etc. Apart from activating voice assistants, KWS has plenty of applications such as speech data mining, audio indexing, phone call routing, etc. [4]

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call