Abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call