Abstract

Query by example spoken term detection (QbE-STD) is a popular keyword detection method in the absence of speech resources. It can build a keyword query system with decent performance when there are few labeled speeches and a lack of pronunciation dictionaries. In recent years, neural acoustic word embeddings (NAWEs) has become a commonly used QbE-STD method. To make the embedded features extracted by the neural network contain more accurate context information, we use wav2vec pre-training to improve the performance of the network. Compared with the Mel-frequency cepstral coefficients(MFCC) system, the average precision (AP) is relatively improved by 11.1%. We also find that the AP of the wav2vec and MFCC splicing system is better, demonstrating that wav2vec cannot contain all spectrum information. To accelerate the convergence speed of the splicing system, we use circle loss to replace the triplet loss, making the convergence about 40% epochs earlier on average. The circle loss also relatively increases AP by more than 4.9%. The AP of our best-performing system is 7.7% better than the wav2vec baseline system and 19.7% better than the MFCC baseline system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call