Machine learning and ontology-based novel semantic document indexing for information retrieval

Anil Sharma,Suresh Kumar

doi:10.1016/j.cie.2022.108940

Abstract

The goal of information retrieval (IR) systems is to find the contents most closely related to the user's information needs from a pool of information. However, conventional IR methods neglect semantic descriptions of document contents and index documents based on the words that they include. When users and indexing systems use different terms to express the same subject, a vocabulary gap emerges. To overcome this limitation and to enhance the effectiveness of the IR systems, this paper introduced a novel hybrid semantic document indexing employing machine learning and domain ontology. The presented technique uses a skip-gram with negative sampling-based machine learning model and a domain ontology to determine the concepts for annotating unstructured documents. The proposed work also introduced multiple feature based novel concept ranking algorithm where statistical, semantic, and scientific named entity features of the concept were used to assign relevance weight to the annotations. The fuzzy analytical hierarchy process was used to derive the parameters of these feature weights. The final step is to rank the concepts according to their relevance to the document. Five benchmark publicly accessible datasets from the computer science domain were used in a series of experiments to validate the results of presented method. Experiment findings showed that the proposed method performs better than state-of-the-art techniques on these datasets, by improving average accuracy by 29%, while an improvement of 25% was recorded in F-measure. The improvement in average accuracy demonstrates that the performance of the proposed approach is better than the state-of-the-art methods in extracting document concepts accurately even when the same concept is referred to by distinct terms in the document and domain ontologies. The proposed system's ability to find similar concepts when the documents possess no concept from domain ontology is demonstrated by the improvement in F-measure, which is attributed to high recall rates of the proposed indexing scheme while maintaining high accuracy.

Full Text