Abstract

ABSTRACTObtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in the success of classification because of data size. To determine the right feature selection methods for text classification is the main purpose of this study. Metrics that are used frequently for feature selection like Chi-square and Information Gain were applied over different data sets and performance was measured. In this study two feature selection metrics, which are based on filtration, are recommended as alternatives to the current ones. The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification. The second one is the alternative Accuracy2 metric, which was obtained by changing the parameters of Accuracy2 metric. It was observed that the suggested Relevance Frequency Feature Selection and Alternative Accuracy2 metrics offer successful results as the current metrics used frequently.

Highlights

  • The internet becomes more common as the days pass and in the meantime, smartphone and tablet use increases

  • The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification

  • It can be stated that all metrics, except Relevance Frequency (RF), generated a potential feature that is related to acq category

Read more

Summary

Introduction

The internet becomes more common as the days pass and in the meantime, smartphone and tablet use increases. This increase in use brings an increase in the amount of data that is created and stored in text format like e-books, emails, Facebook and Twitter. The most important one of these studies is the expert text classification system based on rules and developed by Carnegie Group over Reuters data set [4]. As hardware components like memory and CPU become more advanced and cheaper, use of machine-learning algorithms have become more common and they were tried over text classification problems. The main problem in text classification is the excessive size of the data. It is important to choose terms that have high distinction potential rather than all terms in text classification

Contribution and motivation
Organization
Related works
Document frequency thresholding metric
Chi-Squared metric
Information gain
Acc and Acc2 metrics
Proposed metrics
Used data sets
Reuters data set
Experimental settings
Experimental results
Comparison of features obtained via metrics
Classification successes of metrics
Conclusion and future works
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call