Abstract
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.
Highlights
I NTERACTING with machines via voice is not science fiction anymore
While QbE keyword spotting (KWS) based on Recurrent neural networks (RNNs) feature extraction — which is different from the approach outlined in Section II and requires a careful treatment of its specificities— is out of the scope of this paper, we have considered it pertinent to allude to it for the following twofold reason
We review and provide some criticism of the most common metrics considered in the field of KWS
Summary
Speech technologies have become ubiquitous in nowadays society. A distinctive feature of voice assistants is that, in order to be used, they first have to be activated by means of a spoken wake-up word or keyword, thereby avoiding running far more computationally expensive automatic speech recognition (ASR) when it is not required [2]. Voice assistants deploy a technology called spoken keyword spotting —or keyword spotting— , which can be understood as a subproblem of ASR [3]. Keyword spotting (KWS) can be defined as the task of identifying keywords in audio streams comprising speech. Apart from activating voice assistants, KWS has plenty of applications such as speech data mining, audio indexing, phone call routing, etc. Apart from activating voice assistants, KWS has plenty of applications such as speech data mining, audio indexing, phone call routing, etc. [4]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have