Query-by-Example spoken content retrieval is a demanding and challenging task when a large volume of spoken content is piled up in the repositories without annotation. In the absence of annotation, spoken content retrieval is achieved by capturing the similarities between the query and spoken terms from the acoustic feature representation itself. Dynamic Time Warping (DTW) centric techniques identify the optimal alignment between the acoustic feature representations and capture the similarities between query and spoken terms. Despite feasibility, the DTW-centric techniques produce a lot of false alarms due to the variabilities that exist in natural speech and degrade the performance. In the proposed approach, the variability challenges are addressed in two stages. At first, the speaker-independent acoustic feature representation was obtained from the deep convolutional neural networks that reduce the speaker variabilities. In the second stage, the similarities between the query and spoken term were captured using the heuristic search method. The proposed approach was compared with other state-of-the-art methods using Microsoft Low-Resource Language speech corpus. A 3% improvement and 32% reduction in the hit and false alarm ratio were achieved across languages.
Read full abstract