Abstract
Searching and retrieving documents from large historical archives prove to be challenging for the information retrieval (IR) field as historians typically employ their knowledge, experience, and intuition. There are several works done on the application of IR in historical documents. As such, the conventional IR model is mostly used a simple Bag-of-Word (BOW) approach and usually unable to support precise document retrieval for the domain of history. We proposed an ontology-based approach to semantically index and ranked rich historical documents. The historical documents relating to the Vietnam War were chosen for this study. Several existing ontologies have been reviewed to identify the most suitable concepts and properties which contain rich information pertaining to relevant entities such as an event, time, and people. The domain ontology was developed by utilizing the existing Simple News and Press (SNaP) ontology and extended with concepts related to the Vietnam War. The ontology was then semantically mapped with concepts found in a collection of 133 documents relating to the Vietnam war. In this paper, we also proposed a simple ontology-based weighting mechanism derived from the classic tf-idf scoring scheme. Finally, 20 SPARQL queries are implemented to do the evaluation. The evaluation shows that the proposed ontological-based approach achieved better results as compared to the base-line BM-25 probabilistic retrieval model in terms of precision and recall metrics. The use of the ontology-based approach in document retrieval can compete with the keyword-based approach.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal on Advanced Science, Engineering and Information Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.