Unsupervised Pre-Training for Voice Activation

Aliaksei Kolesau,Dmitrij Šešok

doi:10.3390/app10238643

Abstract

The problem of voice activation is to find a pre-defined word in the audio stream. Solutions such as keyword spotter “Ok, Google” for Android devices or keyword spotter “Alexa” for Amazon devices use tens of thousands to millions of keyword examples in training. In this paper, we explore the possibility of using pre-trained audio features to build voice activation with a small number of keyword examples. The contribution of this article consists of two parts. First, we investigate the dependence of the quality of the voice activation system on the number of examples in training for English and Russian and show that the use of pre-trained audio features, such as wav2vec, increases the accuracy of the system by up to 10% if only seven examples are available for each keyword during training. At the same time, the benefits of such features become less and disappear as the dataset size increases. Secondly, we prepare and provide for general use a dataset for training and testing voice activation for the Lithuanian language. We also provide training results on this dataset.

Highlights

Voice activation systems solve the task of finding predefined keywords or keyphrases in an audio stream [1]
Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult to formulate, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem
The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition (ASR)

Summary

Introduction

Voice activation systems solve the task of finding predefined keywords or keyphrases in an audio stream [1]. This task has attracted both researchers and industry for decades. Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult to formulate, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem. The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition (ASR). Voice activation systems find applications in various areas: telephony [15], speech spoofing detection [16,17] We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 [2], the use of neural networks since 1990 [3,4,5], the use of pattern matching approaches, in particular dynamic time wrapping (DTW) [6], building systems of voice activation for non-English languages such as Chinese [7], Japanese [8], and Iranian [9], publications describing voice activation systems in mass products [10,11,12,13], as well as publishing open datasets to compare different approaches [14].

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Dec 3, 2020
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Unsupervised Pre-Training for Voice Activation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

A 34.7 µW Speech Keyword Spotting IC Based on Subband Energy Feature Extraction
Gexuan Wu ... Shuai Wang
Electronics | VOL. 12
Gexuan Wu, et. al.Gexuan Wu ... Shuai Wang
31 Jul 2023
Electronics | VOL. 12

Deep Spoken Keyword Spotting: An Overview
Ivan Lopez-Espejo ... Jesper Jensen
IEEE Access | VOL. 10
Ivan Lopez-Espejo, et. al.Ivan Lopez-Espejo ... Jesper Jensen
01 Jan 2021
IEEE Access | VOL. 10

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network
Runwei Ding ... Hong Liu
-
Runwei Ding, et. al.Runwei Ding ... Hong Liu
01 Oct 2018
01 Oct 2018

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting
Xingwei Liang ... Ruifeng Xu
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023
Xingwei Liang, et. al.Xingwei Liang ... Ruifeng Xu
01 Jul 2023
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Pre-Training for Voice Activation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences