Improved TFIDF in big news retrieval: An empirical study

Chien-Hsing Chen

doi:10.1016/j.patrec.2016.11.004

Abstract

Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inversedocument frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., “refugee”) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improved TFIDF in big news retrieval: An empirical study

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters

Lead the way for us

Journal: Pattern Recognition Letters	Publication Date: Nov 9, 2016
Citations: 43

Similar Papers

Implementation of Chatbot Artificial Intelligence in a Company Website to Improve Customer Service Automatically Using the TF-IDF Method
Radhiah Hayati ... Suci Ramadani
Journal of Artificial Intelligence and Engineering Applications (JAIEA) | VOL. 4
Radhiah Hayati, et. al.Radhiah Hayati ... Suci Ramadani
15 Oct 2024
Journal of Artificial Intelligence and Engineering Applications (JAIEA) | VOL. 4

A Densely Connected GRU Neural Network Based on Coattention Mechanism for Chinese Rice-Related Question Similarity Matching
Haoriqin Wang ... Huarui Wu
Agronomy | VOL. 11
Haoriqin Wang, et. al.Haoriqin Wang ... Huarui Wu
27 Jun 2021
Agronomy | VOL. 11

Constructing genetic exchange communities among bacteria and archaea
Yingnan Cong
-
Yingnan CongYingnan Cong
21 Oct 2016
21 Oct 2016

SISTEM ANALISIS PENYAKIT MATA BERBASIS PHP DAN MYSQL MENGGUNAKAN METODE TF-IDF
Dicky Iskandar Sobari
Jurnal Teknologi Informasi dan Komunikasi | VOL. 15
Dicky Iskandar SobariDicky Iskandar Sobari
30 Nov 2023
Jurnal Teknologi Informasi dan Komunikasi | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved TFIDF in big news retrieval: An empirical study

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters