Abstract

Keyword spotting (KWS) is an important technique to free users’ hands in man-machine communication. It is quite challenging to build a system with both low False Reject Ratio (FRR) and low False Alarm Ratio (FAR) for real scenarios, especially when computational resources are limited. In this paper, we propose a two-stage KWS system to obtain the trade-off between low computation and high performance. To meet the low-computation requirement, we propose an acoustic model based on multi-resolution GLU stacked 1D convolutional neural network (MRG-SID). The second requirement is achieved by a second stage classification strategy, in which the neural network features are selected as classifier input for final wakeup word detection. Without increasing the relative FRR, it can reduce the FAR by introducing a few network parameters only. Experiments on a 10K hours Mandarin dataset show that the proposed model can achieve a 39.8% relative FRR reduction compared to the traditional Stacked 1D-CNN. With the second stage classifier, we are further able to reduce the FAR relatively by about 70%. In total, our proposed system significantly leads to a 62.1% relative FRR reduction at 0.1 false alarm per hour.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.