Query Expansion Using Proposed Location-Based Algorithm for Hindi–English CLIR: Analyzing Three Test Collections

Sanjay K Dwivedi,Ganesh Chandra

doi:10.1142/s0218001424590018

Abstract

The rapid growth of contents on the Web in different languages increases the demand of Cross-Lingual Information Retrieval (CLIR). The accuracy of result suffers due to many problems such as ambiguity and drift issue in query. Query Expansion (QE) offers reliable solution for obtaining suitable documents for user queries. In this paper, we proposed an architecture for Hindi–English CLIR system using QE for improving the relevancy of retrieved results. In this architecture, for the addition of term(s) at appropriate position(s), we proposed a location-based algorithm to resolve the drift query issue in QE. User queries in Hindi language have been translated into document language (i.e. English) and the accuracy of translation is improved using Back-Translation. Google search has been performed and the retrieved documents are ranked using Okapi BM25 to arrange the documents in the order of decreasing relevancy to select the most suitable terms for QE. We used term selection value (TSV) for QE and for retrieving the terms, we created three test collections namely the (i) description and narration of the Forum for Information Retrieval Evaluation (FIRE) dataset, (ii) Snippets of retrieved documents against each query and (iii) Nearest-Neighborhood (NN) words against each query word among the ranked documents. To evaluate the system, 50 queries of Hindi language are selected from the FIRE-2012 dataset. In this paper, we performed two experiments: (i) impact of the proposed location-based algorithm on the proposed architecture of CLIR; and (ii) analysis of QE using three datasets, i.e. FIRE, NN and Snippets. In the first case, result shows that the relevancy of Hindi–English CLIR is improved by performing QE using the location-based algorithm and a 12% of improvement is achieved as compared to the results of QE obtained without applying the location-based algorithm. In the second case, the location-based algorithm is applied on three datasets. The Mean Average Precision (MAP) values of retrieved documents after QE are 0.5379 (NN), 0.6018 (FIRE) and 0.6406 (Snippets) for the three test collections, whereas the MAP before QE is 0.37102. This clearly shows the significant improvement of retrieved results for all three test collections. Among the three test collections, QE has been found most effective along with Snippets as indicated by the results with the improvements of 6.48% and 19.12% over FIRE and NN test collections, respectively.

Full Text