Abstract

The problem of voice activation is to find a pre-defined word in the audio stream. Solutions such as keyword spotter “Ok, Google” for Android devices or keyword spotter “Alexa” for Amazon devices use tens of thousands to millions of keyword examples in training. In this paper, we explore the possibility of using pre-trained audio features to build voice activation with a small number of keyword examples. The contribution of this article consists of two parts. First, we investigate the dependence of the quality of the voice activation system on the number of examples in training for English and Russian and show that the use of pre-trained audio features, such as wav2vec, increases the accuracy of the system by up to 10% if only seven examples are available for each keyword during training. At the same time, the benefits of such features become less and disappear as the dataset size increases. Secondly, we prepare and provide for general use a dataset for training and testing voice activation for the Lithuanian language. We also provide training results on this dataset.

Highlights

  • Voice activation systems solve the task of finding predefined keywords or keyphrases in an audio stream [1]

  • Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult to formulate, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem

  • The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition (ASR)

Read more

Summary

Introduction

Voice activation systems solve the task of finding predefined keywords or keyphrases in an audio stream [1]. This task has attracted both researchers and industry for decades. Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult to formulate, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem. The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition (ASR). Voice activation systems find applications in various areas: telephony [15], speech spoofing detection [16,17] We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 [2], the use of neural networks since 1990 [3,4,5], the use of pattern matching approaches, in particular dynamic time wrapping (DTW) [6], building systems of voice activation for non-English languages such as Chinese [7], Japanese [8], and Iranian [9], publications describing voice activation systems in mass products [10,11,12,13], as well as publishing open datasets to compare different approaches [14].

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.