Term frequency with average term occurrences for textual information retrieval

O Ali Sadek Ibrahim,D Landa-Silva

doi:10.1007/s00500-015-1935-7

O Ali Sadek Ibrahim, D Landa-Silva

https://doi.org/10.1007/s00500-015-1935-7

Copy DOI

Export

Save

Cite

Journal: Soft Computing	Publication Date: Nov 28, 2015
Citations: 25	License type: cc-by

Affiliation: University of Nottingham

Abstract
Highlights/Summary
Full-Text
Similar Papers

Abstract

Listen

In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.

Highlights

The term-weighting scheme (TWS) is a key component of an information retrieval (IR) system that uses the vector space model (VSM)
The purpose of the first experiment was to compare the average recall precision values achieved by the proposed Term Frequency With Average Term Occurrence (TF-ATO) with and without the discriminative approach to the ones achieved by term frequency-inverse document frequency (TF-inverse document frequency (IDF))
The purpose of the second experiment was to compare the average recall precision values achieved by the proposed TF-ATO with the discriminative approach to the ones achieved by TF-IDF but considering the document collection as dynamic

Summary

Introduction

The term-weighting scheme (TWS) is a key component of an information retrieval (IR) system that uses the vector space model (VSM). TWS evolved with Genetic Programming (GP) as in (Cummins, 2008; Cordan et al, 2003) are based on the characteristics of the test collections and not generalizable to be effective on collections with different characteristics These proposed EC techniques assume that document collections are static and not dynamic . We argue that there is a need for heuristic methods to adapt term weights with little computational cost and a pre-determined procedure in order to achieve better IR system effectiveness and performance even when dealing with dynamic document collection. This is what motivates the work presented in this paper on the development of such a TWS.

General Information Retrieval Approach

IR Architecture

IR Models

IR System Evaluation

A New Term-Weighting Scheme

Related Work on TWS

Limitation of Evolved TWS and Term Weights

TF-ATO TWS

Experimental Results and Analysis

Stop-words Removal and DA Case Studies

Related Work on Stop-word Lists

General Stoplists

Collection-Based Stoplists: These stoplists are ument collections

Evolving Stoplists

Conclusion and Future Work

Full Text

Published Version

Check institute access

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Term frequency with average term occurrences for textual information retrieval

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: Soft Computing

Lead the way for us

Similar Papers

Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval
Belal Mustafa Abuata ... Lama Ali Al Omari
International Journal on Advanced Science, Engineering and Information Technology | VOL. 12
Belal Mustafa Abuata, et. al.Belal Mustafa Abuata ... Lama Ali Al Omari
11 Dec 2022
International Journal on Advanced Science, Engineering and Information Technology | VOL. 12

Analysis of Text Classification with various Term Weighting Schemes in Vector Space Model
Shitanshu Jain ... Dr Santosh K Vishwakarma
International Journal of Innovative Technology and Exploring Engineering | VOL. 9
Shitanshu Jain, et. al.Shitanshu Jain ... Dr Santosh K Vishwakarma
30 Aug 2020
International Journal of Innovative Technology and Exploring Engineering | VOL. 9

TF-TDA: A Novel Supervised Term Weighting Scheme for Sentiment Analysis
Arwa Alshehri ... Abdulmohsen Algarni
Electronics | VOL. 12
Arwa Alshehri, et. al.Arwa Alshehri ... Abdulmohsen Algarni
30 Mar 2023
Electronics | VOL. 12

Meta-scoring
Rong Jin ... Alex G Hauptmann
-
Rong Jin, et. al.Rong Jin ... Alex G Hauptmann
01 Sep 2001
01 Sep 2001

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Term frequency with average term occurrences for textual information retrieval

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: Soft Computing