Abstract

Cross Language Information Retrieval (CLIR), is the process of retrieving relevant documents, where in the language of the given query is different from the language of the retrieved documents. CLIR systems allow the users to search and access documents in the language different from the language of the search query. CLIR systems have been divided into Monolingual CLIR, Bi-lingual CLIR, and Multilingual CLIR based on different languages of query and documents. The first step of the Cross Language Information Retrieval system is the text pre-processing of given text documents in to useful representations. Pre-processing is the set of tasks that convert the given text documents into a suitable format for any higher-level text related applications. This technique can be used to reduce the computational process, noise data, and irrelevant information from the given text documents. This paper discusses in detail the different pre-processing techniques such as dataset creation, tokenization, noise removal, stop word removal, stemming, lemmatization and finally term weighting of two languages dataset (i.e., Tamil and Malayalam), which is manually collected from BBC online website. Finally, the study investigates feature extraction techniques of Term Frequency- Inverse Document Frequency (TF-IDF). These techniques will help to design and model CLIR systems with high performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call