Abstract

Query-by-example spoken term detection (QbE-STD) refers to the task of determining the subsequence of a reference which matches with a query, where both the query and the reference are in audio format. Dynamic time warping (DTW) based techniques are explored to match the two sequences with different lengths in an unsupervised manner. In this paper, a completely unsupervised approach based on Segmental DTW (SDTW), a variant of DTW, is considered for the task of QbE-STD where both reference and query utterances are represented using a sequence of Gaussian posteriorgram vectors. SDTW using two different types of bands i.e., Sakoe-Chiba band and Itakura parallelogram is considered to compare the Gaussian posteriorgrams of the query and the reference sequence. The effect of varying different local constraints of the DTW algorithm on the performance of SDTW is also analyzed in this paper. Results obtained on MediaEval 2012 dataset indicate that SDTW using a band with variable speaking rate, as in Itakura parallelogram, performs better compared to that of using a band with fixed speaking rate, as in Sakoe-Chiba band, across all variations in local constraints.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.