Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Jingyong Hou,Lei Xie,Shilei Zhang

doi:10.1016/j.neunet.2022.03.003

Abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Abstract

Talk to us

Similar Papers

More From: Neural Networks

Lead the way for us

Journal: Neural Networks	Publication Date: Mar 10, 2022
Citations: 6

Similar Papers

Customized Wake-Up Word with Key Word Spotting using Convolutional Neural Network
Tsung-Han Tsai ... Ping-Cheng Hao
-
Tsung-Han Tsai, et. al.Tsung-Han Tsai ... Ping-Cheng Hao
06 Oct 2019
06 Oct 2019

Region Proposal Network Based Small-Footprint Keyword Spotting
Jingyong Hou ... Mei-Yuh Hwang
IEEE Signal Processing Letters | VOL. 26
Jingyong Hou, et. al.Jingyong Hou ... Mei-Yuh Hwang
01 Oct 2019
IEEE Signal Processing Letters | VOL. 26

Focal Loss for Region Proposal Network
Chengpeng Chen ... Shuqiang Jiang
-
Chengpeng Chen, et. al.Chengpeng Chen ... Shuqiang Jiang
01 Jan 2018
01 Jan 2018

A $1.5\mu\mathrm{W}$ End-to-End Keyword Spotting SoC with Content-Adaptive Frame Sub-Sampling and Fast-Settling Analog Frontend
Ji-Hwan Seoi ... Heejin Yang
-
Ji-Hwan Seoi, et. al.Ji-Hwan Seoi ... Heejin Yang
19 Feb 2023
19 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Abstract

Talk to us

Similar Papers

More From: Neural Networks