Abstract

Searching and retrieving documents from large historical archives prove to be challenging for the information retrieval (IR) field as historians typically employ their knowledge, experience, and intuition. There are several works done on the application of IR in historical documents. As such, the conventional IR model is mostly used a simple Bag-of-Word (BOW) approach and usually unable to support precise document retrieval for the domain of history. We proposed an ontology-based approach to semantically index and ranked rich historical documents. The historical documents relating to the Vietnam War were chosen for this study. Several existing ontologies have been reviewed to identify the most suitable concepts and properties which contain rich information pertaining to relevant entities such as an event, time, and people. The domain ontology was developed by utilizing the existing Simple News and Press (SNaP) ontology and extended with concepts related to the Vietnam War. The ontology was then semantically mapped with concepts found in a collection of 133 documents relating to the Vietnam war. In this paper, we also proposed a simple ontology-based weighting mechanism derived from the classic tf-idf scoring scheme. Finally, 20 SPARQL queries are implemented to do the evaluation. The evaluation shows that the proposed ontological-based approach achieved better results as compared to the base-line BM-25 probabilistic retrieval model in terms of precision and recall metrics. The use of the ontology-based approach in document retrieval can compete with the keyword-based approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call