Improved Distance Functions for Instance-Based Text Classification.

Khalil El Hindi,Bayan Abu Shawar,Reem Aljulaidan,Hussien Alsalamn,Bruce J. MacLennan

doi:10.1155/2020/4717984

Abstract

Text classification has many applications in text processing and information retrieval. Instance-based learning (IBL) is among the top-performing text classification methods. However, its effectiveness depends on the distance function it uses to determine similar documents. In this study, we evaluate some popular distance measures' performance and propose new ones that exploit word frequencies and the ordinal relationship between them. In particular, we propose new distance measures that are based on the value distance metric (VDM) and the inverted specific-class distance measure (ISCDM). The proposed measures are suitable for documents represented as vectors of word frequencies. We compare these measures' performance with their original counterparts and with powerful Naïve Bayesian-based text classification algorithms. We evaluate the proposed distance measures using the kNN algorithm on 18 benchmark text classification datasets. Our empirical results reveal that the distance metrics for nominal values render better classification results for text classification than the Euclidean distance measure for numeric values. Furthermore, our results indicate that ISCDM substantially outperforms VDM, but it is also more susceptible to make use of the ordinal nature of term-frequencies than VDM. Thus, we were able to propose more ISCDM-based distance measures for text classification than VDM-based measures. We also compare the proposed distance measures with Naïve Bayesian-based text classification, namely, multinomial Naïve Bayes (MNB), complement Naïve Bayes (CNB), and the one-versus-all-but-one (OVA) model. It turned out that when kNN uses some of the proposed measures, it outperforms NB-based text classifiers for most datasets.

Highlights

Text classification can be defined as the task of assigning a document to a category such as art, sport, and politics. e proliferation of online documents by the minute made automatic text classification an essential component of many online systems
As the augmented VDM (AVDM) is applied to a larger space, it increases the computational cost of the value difference metric (VDM) [17], which may hinder its use in text classification, where documents are typically represented using a large number of features
Comparing the first quartile values of the inverted specific-class distance measure (ISCDM) and Euclidean VDM (EVDM) shows that 75% of the classification accuracies of the ISCDM and the EVDM for all datasets are higher than 82.31% and 75.70%, respectively

Summary

Introduction

Text classification can be defined as the task of assigning a document to a category such as art, sport, and politics. e proliferation of online documents by the minute made automatic text classification an essential component of many online systems. 2. Related Work is section discusses the distance measures and powerful Bayesian-based text classification methods that we use in our empirical comparisons. As the AVDM is applied to a larger space, it increases the computational cost of the VDM [17], which may hinder its use in text classification, where documents are typically represented using a large number of features. A decision tree produced using a distance-based attribute measure is used to determine the neighborhood of a query instance. Erefore, the VDM would consider the two nominal values sweet and sour similar, but this could be misleading when comparing a query instance with a sweet taste with a training instance of class lemon with a sour taste.

New VDM-Based and ISCDM-Based Distance Measures for Text Classification

Experimental Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational intelligence and neuroscience	Publication Date: Nov 22, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improved Distance Functions for Instance-Based Text Classification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational intelligence and neuroscience

Lead the way for us

Similar Papers

Nonlinear strict distance and similarity measures for intuitionistic fuzzy sets with applications to pattern classification and medical diagnosis
Xinxing Wu ... Miin-Shen Yang
Scientific Reports | VOL. 13
Xinxing Wu, et. al.Xinxing Wu ... Miin-Shen Yang
25 Aug 2023
Scientific Reports | VOL. 13

Construction and generation of distance and similarity measures for intuitionistic fuzzy sets and various applications
Brindaban Gohain ... Rituparna Chutia
International Journal of Intelligent Systems | VOL. 36
Brindaban Gohain, et. al.Brindaban Gohain ... Rituparna Chutia
19 Aug 2021
International Journal of Intelligent Systems | VOL. 36

A new similarity measure for vector space models in text classification and information retrieval
Mete Eminagaoglu
Journal of Information Science | VOL. 48
Mete EminagaogluMete Eminagaoglu
27 Oct 2020
Journal of Information Science | VOL. 48

A distance measure between intuitionistic fuzzy belief functions
Yafei Song ... Hailin Zhang
Knowledge-Based Systems | VOL. 86
Yafei Song, et. al.Yafei Song ... Hailin Zhang
25 Jun 2015
Knowledge-Based Systems | VOL. 86

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Distance Functions for Instance-Based Text Classification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational intelligence and neuroscience