Abstract

Recently, language models have gained importance in the field of information retrieval. In this paper, we propose a generic language model to improve a content-based document retrieval system. In this approach, character images are extracted, clustered, and analyzed to form high-level semantic terms using a statistical document model. This model simulates the long-term relationships between characters. Documents are then indexed according to these terms, and a query document is proposed to retrieve the relevant documents. The query document can be a single keyword, or it can be synthesized from a text string. The aim is to generate a semantic representation from low-level image pixels through pattern matching and document modeling. The conventional approach of generating semantic terms in document retrieval includes every possible symbol sequence in the feature representation. Comparatively, our approach can considerably reduce the dimensions of the feature space while producing retrieval results comparable to those of the conventional and state-of-the-art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.