Robust Query-by-example Spoken Term Detection for Unknown Words Using Speech Retrieval-oriented E2E ASR Modeling

Takumi Kurokawa,Atsuhiko Kai

doi:10.1109/gcce53005.2021.9621804

Abstract

Query-by-example spoken term detection (STD) systems can make good use of automatic speech recognition (ASR), especially when the error rate is low. However, ASR suffers from the out-of-vocabulary (OOV) problem. The OOV problem in the ASR stage has a significant impact on the performance of STD for speech retrieval and can generate false retrieval for query words. In recent studies, End-to-end (E2E) ASR systems have achieved competitive performance compared to traditional DNN-HMM ASR systems. It has also been shown that E2E ASR system can reduce the impact of the OOV problem by using characters or sub-words as the output unit during recognition. In this paper, we propose an improved method using E2E ASR modeling adapted to a speech retrieval task, based on the STD method that considers acoustic similarity at the sub-phone level. Experimental results using the NTCIR-12 SpokenQuery&Doc-2 task show that the STD method using E2E ASR improves retrieval performance over the STD method using DNN-HMM ASR. This is attributed the fact that E2E ASR was able to reduce the OOV problem for spoken documents and spoken queries.

Full Text