Abstract

AbstractThis paper describes a Japanese spoken document retrieval system that is robust for Out‐of‐Vocabulary (OOV) words. A standard approach to spoken document retrieval is to automatically transcribe spoken documents into word sequences, which can be directly matched against queries. In this approach, the documents including OOV words and words misrecognized as other words cannot be retrieved. To avoid this problem, we propose a novel method of spoken document retrieval considering OOV keywords. One approach we use is to create an index from multiple recognizer outputs to deal with transcribed documents including misrecognized words. The index becomes better to use multiple recognizers which have different characteristics from one another. The other is to use both word‐based indexing for in‐vocabulary keywords and syllable‐based indexing for OOV keywords, then switch them according to in‐vocabulary/OOV keywords in the query. Evaluation results clearly show that this approach benefits from the advantages of both indexing methods and that the proposed technique is quite effective in robustly retrieving spoken documents. © 2004 Wiley Periodicals, Inc. Syst Comp Jpn, 35(14): 44–53, 2004; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.10697

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.