Abstract

Text classification has many applications in text processing and information retrieval. Instance-based learning (IBL) is among the top-performing text classification methods. However, its effectiveness depends on the distance function it uses to determine similar documents. In this study, we evaluate some popular distance measures' performance and propose new ones that exploit word frequencies and the ordinal relationship between them. In particular, we propose new distance measures that are based on the value distance metric (VDM) and the inverted specific-class distance measure (ISCDM). The proposed measures are suitable for documents represented as vectors of word frequencies. We compare these measures' performance with their original counterparts and with powerful Naïve Bayesian-based text classification algorithms. We evaluate the proposed distance measures using the kNN algorithm on 18 benchmark text classification datasets. Our empirical results reveal that the distance metrics for nominal values render better classification results for text classification than the Euclidean distance measure for numeric values. Furthermore, our results indicate that ISCDM substantially outperforms VDM, but it is also more susceptible to make use of the ordinal nature of term-frequencies than VDM. Thus, we were able to propose more ISCDM-based distance measures for text classification than VDM-based measures. We also compare the proposed distance measures with Naïve Bayesian-based text classification, namely, multinomial Naïve Bayes (MNB), complement Naïve Bayes (CNB), and the one-versus-all-but-one (OVA) model. It turned out that when kNN uses some of the proposed measures, it outperforms NB-based text classifiers for most datasets.

Highlights

  • Text classification can be defined as the task of assigning a document to a category such as art, sport, and politics. e proliferation of online documents by the minute made automatic text classification an essential component of many online systems

  • As the augmented VDM (AVDM) is applied to a larger space, it increases the computational cost of the value difference metric (VDM) [17], which may hinder its use in text classification, where documents are typically represented using a large number of features

  • Comparing the first quartile values of the inverted specific-class distance measure (ISCDM) and Euclidean VDM (EVDM) shows that 75% of the classification accuracies of the ISCDM and the EVDM for all datasets are higher than 82.31% and 75.70%, respectively

Read more

Summary

Introduction

Text classification can be defined as the task of assigning a document to a category such as art, sport, and politics. e proliferation of online documents by the minute made automatic text classification an essential component of many online systems. 2. Related Work is section discusses the distance measures and powerful Bayesian-based text classification methods that we use in our empirical comparisons. As the AVDM is applied to a larger space, it increases the computational cost of the VDM [17], which may hinder its use in text classification, where documents are typically represented using a large number of features. A decision tree produced using a distance-based attribute measure is used to determine the neighborhood of a query instance. Erefore, the VDM would consider the two nominal values sweet and sour similar, but this could be misleading when comparing a query instance with a sweet taste with a training instance of class lemon with a sour taste.

New VDM-Based and ISCDM-Based Distance Measures for Text Classification
Experimental Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.