Re-ranking spoken term detection with acoustic exemplars of keywords

Van Tung Pham,Haihua Xu,Xiong Xiao,Nancy F Chen,Eng Siong Chng,Haizhou Li

doi:10.1016/j.specom.2018.09.004

Abstract

Spoken term detection (STD) systems rank hypothesized detections by scores, which indicate how confident a hypothesized detection is a true instance of the keyword. Many STD systems rely on automatic speech recognition (ASR) to transcribe the speech content into the lattice representation. In such STD systems, the detection scores are usually estimated as the posterior probabilities of the keyword in the decoding lattices. Such scores may be inaccurate, e.g. due to the imperfect modeling of speech and noise. To improve the ranking of hypothesized detections, we propose to directly utilize the acoustic similarity scores between the speech signal of hypothesized detections and that of the keyword exemplars. A keyword exemplar is a true instance of the keyword obtained from an annotated speech corpus. When no exemplar is available, we propose to synthesize exemplars from the annotated speech corpus. Given the acoustic similarity between the hypothesized detections and keyword exemplars, two re-ranking methods are proposed, i.e. re-ranking by score fusion and re-ranking by similarity graph. Experimental results on the NIST OpenKWS14 and OpenKWS15 datasets show that the proposed re-ranking framework significantly outperforms the ranking based only on ASR confidence scores and also other re-ranking methods without using keyword exemplars.

Full Text