Abstract

Query-by-example spoken term detection (STD) systems can make good use of automatic speech recognition (ASR), especially when the error rate is low. However, ASR suffers from the out-of-vocabulary (OOV) problem. The OOV problem in the ASR stage has a significant impact on the performance of STD for speech retrieval and can generate false retrieval for query words. In recent studies, End-to-end (E2E) ASR systems have achieved competitive performance compared to traditional DNN-HMM ASR systems. It has also been shown that E2E ASR system can reduce the impact of the OOV problem by using characters or sub-words as the output unit during recognition. In this paper, we propose an improved method using E2E ASR modeling adapted to a speech retrieval task, based on the STD method that considers acoustic similarity at the sub-phone level. Experimental results using the NTCIR-12 SpokenQuery&Doc-2 task show that the STD method using E2E ASR improves retrieval performance over the STD method using DNN-HMM ASR. This is attributed the fact that E2E ASR was able to reduce the OOV problem for spoken documents and spoken queries.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call