Abstract

Speech keyword search (KWS) is the task of automatically detecting the required keywords in continuous speech. Single-keyword detection can be regarded as the task of speech keyword wake-up. For many practical applications of these small vocabulary speech recognition tasks, it is costly and unnecessary to build a full large vocabulary speech recognition system. For tasks related to speech keyword search, insufficiency in data resources remains the main challenge so far. Speech pre-training has become an effective technique, showing its superiority in a variety of tasks. The key idea is to learn effective representations in settings where a large amount of unlabeled data is available to improve the performance while labeled data of downstream tasks are limited. This research focuses on the combination of unsupervised pre-training and keyword search based on the Keyword-Filler model and introduces unsupervised pre-training into speech keyword search. The research selects pre-trained model architecture Wav2vec2.0 including XLSR. The research results show that training with feature extracted by pre-trained model performs better than the baseline. In the case of low-resource condition, the baseline performance drops significantly, while the performance of the pre-trained tuned model does not decrease but even increases slightly in some intervals. It can be seen that the pre-trained model can be tuned to achieve better performance on very little data. This shows the advantage and application value of keyword search based on unsupervised pre-training.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call