Analysis of constraints on segmental DTW for the task of query-by-example spoken term detection

Sri Harsha Dumpala,Suryakanth V Gangashetty,K.N.R.K Raju Alluri,Anil Kumar Vuppala

doi:10.1109/indicon.2015.7443702

Abstract

Query-by-example spoken term detection (QbE-STD) refers to the task of determining the subsequence of a reference which matches with a query, where both the query and the reference are in audio format. Dynamic time warping (DTW) based techniques are explored to match the two sequences with different lengths in an unsupervised manner. In this paper, a completely unsupervised approach based on Segmental DTW (SDTW), a variant of DTW, is considered for the task of QbE-STD where both reference and query utterances are represented using a sequence of Gaussian posteriorgram vectors. SDTW using two different types of bands i.e., Sakoe-Chiba band and Itakura parallelogram is considered to compare the Gaussian posteriorgrams of the query and the reference sequence. The effect of varying different local constraints of the DTW algorithm on the performance of SDTW is also analyzed in this paper. Results obtained on MediaEval 2012 dataset indicate that SDTW using a band with variable speaking rate, as in Itakura parallelogram, performs better compared to that of using a band with fixed speaking rate, as in Sakoe-Chiba band, across all variations in local constraints.

Full Text