Improves Neural Acoustic Word Embeddings Query by Example Spoken Term Detection with Wav2vec Pretraining and Circle Loss

Ta Li,Zhaoqi Li,Yonghong Yan,Long Wu

doi:10.1109/iscslp49672.2021.9362065

Abstract

Query by example spoken term detection (QbE-STD) is a popular keyword detection method in the absence of speech resources. It can build a keyword query system with decent performance when there are few labeled speeches and a lack of pronunciation dictionaries. In recent years, neural acoustic word embeddings (NAWEs) has become a commonly used QbE-STD method. To make the embedded features extracted by the neural network contain more accurate context information, we use wav2vec pre-training to improve the performance of the network. Compared with the Mel-frequency cepstral coefficients(MFCC) system, the average precision (AP) is relatively improved by 11.1%. We also find that the AP of the wav2vec and MFCC splicing system is better, demonstrating that wav2vec cannot contain all spectrum information. To accelerate the convergence speed of the splicing system, we use circle loss to replace the triplet loss, making the convergence about 40% epochs earlier on average. The circle loss also relatively increases AP by more than 4.9%. The AP of our best-performing system is 7.7% better than the wav2vec baseline system and 19.7% better than the MFCC baseline system.

Full Text