Combining evidences from detection sources for query-by-example spoken term detection

Maulik C Madhavi,Hemant A Patil

doi:10.1109/apsipa.2017.8282106

Abstract

The objective of this paper is to explore various detection cues for Query-by-Example Spoken Term Detection (QbE-STD) system. Under template matching paradigm, Dynamic Time Warping (DTW) has been used extensively for QbE-STD task. DTW detection score relies on the alignment of features w.r.t. query and test utterance. In order to improve the performance, we supply additional detection cues along with DTW. These detection cues are pseudo-relevant query derived from the first level of detection, the self similarity matrix, the depth of valley along the warping path in DTW, term frequency- bag of acoustic word vector, and the weighted mean representation. We propose to use these cues for both phonetic posteriorgram and Gaussian posteriorgram. The proposed approach exploits the information from the single detection score rather than exploiting multiple feature fusion. Proposed approach was evaluated on MediaEval Spoken Web Search (SWS) 2013 database and with Maximum Term Weighted Value (MTWV) as the performance measure. The score-level fusion of detection cues with the posteriorgram gave on an average improvement in MTWV by 0.015 (i.e., 1.5 %) and 0.025 (i.e., 2.5 %) on the evaluation set for phonetic posteriorgram and Gaussian posteriorgram, respectively.

Full Text