Keyword Spotting System Research Articles

The mismatches due to pitch, speaking rate, formant dispersion and ambient noise deteriorate the performance of an automatic keyword spotting (KWS) system. The work presented in this paper aims at reducing the aforementioned mismatches through front-end signal processing. In the proposed approach, the short-term magnitude spectra (ST-MS) are firstly computed with a smaller frameshift and then averaged over the adjacent frames to enhance the formant regions. The formant regions in the ST-MS have a higher magnitude than the nearing frequency regions. Consequently, temporal averaging of ST-MS over adjacent frames suppresses the high-frequency variation due to pitch, and formant dispersion. Furthermore, the formant peaks can be more accurately detected from the temporal averaged ST-MS when compared to detection from the original ST-MS. The Mel frequency cepstral coefficients (MFCC) computed from the temporally averaged magnitude spectra (TAS-MFCC) are pitch robust compared to the MFCC, and MFCC extracted from the reported spectral smoothing approaches employing variational mode decomposition (VMD-MFCC), pitch adaptive cepstral truncation (PACT-MFCC) and single-pole filter (SPS-MFCC). Performance of TA-MFCC feature in mismatched test condition is further improved by appending five logarithmically compressed resonant peaks at least separated by 400 Hz, here this feature is termed as TAS-MFCC-ARP. The spectral peaks mostly represent the formants in ST-MS. The performances of the deep neural network-hidden Markov model-based children’s KWS system reported in this work show that the TAS-MFCC-ARP provides a relative performance improvement of 103.83% compared to MFCC. The performance of the KWS system is further improved by data-augmented training through duration modification.

Read full abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

Read full abstract

Keyword Spotting System Research Articles

Related Topics

Articles published on Keyword Spotting System

Artificial basilar membrane/hair cell integrated acoustic system for keyword spotting in noisy environments inspired by human cochlea

Advantages and Pitfalls of Dataset Condensation: An Approach to Keyword Spotting with Time-Frequency Representations

Robust Dual-Modal Speech Keyword Spotting for XR Headsets.

Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications

Design Of Emergency Keyword Recognition Using Arduino Nano BLE Sense 33 And Edge Impulse

A Low-Power Keyword Spotting System With High-Order Passive Switched-Capacitor Bandpass Filters for Analog-MFCC Feature Extraction

A Resource-Efficient Keyword Spotting System Based on a One-Dimensional Binary Convolutional Neural Network

A 34.7 µW Speech Keyword Spotting IC Based on Subband Energy Feature Extraction

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks.

Broadcast Attention Learning for Real Telephone Speech Keyword Spotting

KEYWORD SPOTTING SYSTEM WITH NANO 33 BLE SENSE USING EMBEDDED MACHINE LEARNING APPROACH

Fearless Steps APOLLO: Challenges in keyword spotting and topic detection for naturalistic audio streams

Enhancement of formant regions in magnitude spectra to develop children’s KWS system in zero resource scenario

A Configurable Accelerator for Keyword Spotting Based on Small-Footprint Temporal Efficient Neural Network

Generation of Black-box Audio Adversarial Examples Based on Gradient Approximation and Autoencoders

Quality Driven Systematic Approximation for Binary-Weight Neural Network Deployment

A Fast Precision Tuning Solution for Always-On DNN Accelerators

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Ultralow Power Feature Extractor Using Switched-Capacitor-Based Bandpass Filter, Max Operator, and Neural Network Processor for Keyword Spotting

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Keyword Spotting System Research Articles

Related Topics

Articles published on Keyword Spotting System

Artificial basilar membrane/hair cell integrated acoustic system for keyword spotting in noisy environments inspired by human cochlea

Advantages and Pitfalls of Dataset Condensation: An Approach to Keyword Spotting with Time-Frequency Representations

Robust Dual-Modal Speech Keyword Spotting for XR Headsets.

Hardware–Software Co-Design of an Audio Feature Extraction Pipeline for Machine Learning Applications

Design Of Emergency Keyword Recognition Using Arduino Nano BLE Sense 33 And Edge Impulse

A Low-Power Keyword Spotting System With High-Order Passive Switched-Capacitor Bandpass Filters for Analog-MFCC Feature Extraction

A Resource-Efficient Keyword Spotting System Based on a One-Dimensional Binary Convolutional Neural Network

A 34.7 µW Speech Keyword Spotting IC Based on Subband Energy Feature Extraction

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks.

Broadcast Attention Learning for Real Telephone Speech Keyword Spotting

KEYWORD SPOTTING SYSTEM WITH NANO 33 BLE SENSE USING EMBEDDED MACHINE LEARNING APPROACH

Fearless Steps APOLLO: Challenges in keyword spotting and topic detection for naturalistic audio streams

Enhancement of formant regions in magnitude spectra to develop children’s KWS system in zero resource scenario

A Configurable Accelerator for Keyword Spotting Based on Small-Footprint Temporal Efficient Neural Network

Generation of Black-box Audio Adversarial Examples Based on Gradient Approximation and Autoencoders

Quality Driven Systematic Approximation for Binary-Weight Neural Network Deployment

A Fast Precision Tuning Solution for Always-On DNN Accelerators

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Ultralow Power Feature Extractor Using Switched-Capacitor-Based Bandpass Filter, Max Operator, and Neural Network Processor for Keyword Spotting