Abstract

Spoken term detection (STD) systems rank hypothesized detections by scores, which indicate how confident a hypothesized detection is a true instance of the keyword. Many STD systems rely on automatic speech recognition (ASR) to transcribe the speech content into the lattice representation. In such STD systems, the detection scores are usually estimated as the posterior probabilities of the keyword in the decoding lattices. Such scores may be inaccurate, e.g. due to the imperfect modeling of speech and noise. To improve the ranking of hypothesized detections, we propose to directly utilize the acoustic similarity scores between the speech signal of hypothesized detections and that of the keyword exemplars. A keyword exemplar is a true instance of the keyword obtained from an annotated speech corpus. When no exemplar is available, we propose to synthesize exemplars from the annotated speech corpus. Given the acoustic similarity between the hypothesized detections and keyword exemplars, two re-ranking methods are proposed, i.e. re-ranking by score fusion and re-ranking by similarity graph. Experimental results on the NIST OpenKWS14 and OpenKWS15 datasets show that the proposed re-ranking framework significantly outperforms the ranking based only on ASR confidence scores and also other re-ranking methods without using keyword exemplars.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.