Abstract
The issue of Speech Keyword Retrieval (SKR) has received considerable critical attention. SKR aims to retrieve data from a speech repository given by a spoken query. The accuracy of retrieval often depends on the performance of acoustic model. In this paper, we proposed a new speech keyword retrieval framework called DCNN-CTC using Deep Convolutional Neural Network (DCNN) based on Connectionist Temporal Classification (CTC). The proposed method provides new insights into multimedia information retrieval. The pre-trained models are fine-tuned with a CTC loss to predict target keywords, and the features are extracted by DCNN, which is a complete end-to-end acoustic model training. It does not need to align and label the data one by one in advance, and CTC directly outputs the probability of sequence prediction, which greatly improves the processing performance of the speech retrieval system. Our experimental results on benchmark datasets show that our approach leads to stable and robust retrieval performance, and the precision rate and recall rate of DCNN-CTC are much higher than the baseline system.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have