Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model

Rajnish M Rakholia,Jatinderkumar R Saini

doi:10.1007/978-981-10-3156-4_1

Abstract

Based on user query, to retrieve most relevant documents from the web for resource poor languages is a crucial task in Information Retrieval (IR) system. This paper presents Cosine Similarity Based Vector Space Document Model (VSDM) for Information Retrieval in Gujarati language. VSDM is widely used in information retrieval and document classification where each document is represented as a vector and each dimension corresponds to a separate term. Influence and relevancy of documents with user query is measured using cosine similarity under vector space where set of documents is considered as a set of vectors. The present work considers user query as a free order text, i.e., the word sequence does not affect results of the IR system. Technically, this is Natural Language Processing (NLP) application wherein stop-words removal, Term Frequency (TF) calculation, Normalized Term Frequency (NF) calculation and Inverse Document Frequency (IDF) calculation was done for 1360 files using Text and PDF formats and precision and recall values of 78 % and 86 % efficiency respectively were recorded. To the best of our knowledge, this is first IR task in Gujarati language using cosine similarity based calculations.

Full Text