Document retrieval using clustering-based Aquila hash-Q optimization with query expansion based on pseudo relevance feedback

Bhushan Inje,Kapil Nagwanshi,Radhakrishna Rambola

doi:10.1080/1206212x.2024.2342715

Abstract

A document retrieval system helps users to retrieve the relevant documents corresponding to their query quickly and easily. In the real world, document retrieval is a difficult task due to high volumes of data, unstructured data, and different formats of data. Even though many research techniques are introduced, major problems like vocabulary mismatch and non-linear matching still need to be solved. In this work, the Aquila hash-q optimizer is the proposed matching technique with the clustering technique to retrieve the document in a time-efficient manner for the user query without collision. First, preprocessing is done by eliminating the stop words from the document, stemming, and grouping documents in a cluster into a single document using Hierarchical Density-based Sampling Spatial Cluster of Applications with Noise (HDBSSCAN) clustering. This clustering algorithm is powerful, robust to noise, and scalable and identifies clusters of documents that are related to each other. Additionally, the sampling technique used in this clustering algorithm increases the clustering speed by reducing the size of the document which improves the performance of document retrieval systems. Secondly, the queries are searched using the Aquila hash-q optimizer matching technique by which the relevant documents are retrieved. The Aquila hash-q optimization works by pre-computing a hash table of the terms in a document collection and then using this hash table to quickly identify the relevant documents from the given query. This can significantly improve the speed of document retrieval, especially for large document collections. Aquila hash-q optimization can improve the accuracy, efficiency, and scalability of document retrieval systems. The effectiveness of the Hierarchical Density-Based Clustering Aquila Optimization approach is determined by various analyses through NPL, LISA, and CACM data in terms of precision @ 5 (0.497), precision @ 10 (0.425), Mean Average Precision (MAP) (0.462) by comparing our approach with various methods. As a result, the Aquila hash-q optimizer is the proposed matching technique to retrieve the document in a time-efficient manner for the user query without collision.

Full Text