A New Big Data Feature Selection Approach for Text Classification

Houda Amazal,Mohamed Kissi

doi:10.1155/2021/6645345

Abstract

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

Highlights

Feature selection (FS) plays a key role in data mining [1], especially in text classification task that suffers from large dimensionality [2] in many application domains such as sentiment analysis [3], emotion identification [4, 5], and spam filtering [6]
Results and Discussion. e macro-F1 and micro-F1 performances of MTF-mutual information (MI) are compared to four widely used feature selection techniques using Naıve Bayes classifier applied on three datasets (20 Newsgroups, Reuters-21578, and WebKB). e four feature selection techniques used for comparison are the classical MI, Chi-square (CHI), Term Frequency-Inverse Document Frequency (TF-IDF), and Information Gain (IG)
Conclusion is paper introduces Mutual Information (MTF-MI), a distributed feature selection approach designed upon the MapReduce programming model. e proposed approach, based on mutual information method, has been implemented using Apache Hadoop, and it has been applied over three different large datasets. e performance of resulting classification models generated by MTF-MI has been systematically evaluated using Naıve Bayes classifier, implemented in Hadoop framework, over a cluster of five computers. e experimental study has proved that MTFMI efficiently improves the selection of the relevant features while discarding the selection of irrelevant ones. e proposed approach is the best in average of F-measure compared to four state-of-the-art methods, namely, CHI, term frequency (TF)-IDF, MI, and IG

Summary

Introduction

Feature selection (FS) plays a key role in data mining [1], especially in text classification task that suffers from large dimensionality [2] in many application domains such as sentiment analysis [3], emotion identification [4, 5], and spam filtering [6]. (3) An algorithm is proposed to calculate the MI value to evaluate the relationship between features and categories under the Hadoop framework. We introduce our approach based on the TF and term distribution to evaluate the feature ti in category ck as follows: MTF − MI

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Programming	Publication Date: Apr 19, 2021
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A New Big Data Feature Selection Approach for Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Similar Papers

Wrapper and Hybrid Feature Selection Methods Using Metaheuristic Algorithms for English Text Classification: A Systematic Review
Osamah Mohammed Alyasiri ... Yu-N Cheah
IEEE Access | VOL. 10
Osamah Mohammed Alyasiri, et. al.Osamah Mohammed Alyasiri ... Yu-N Cheah
01 Jan 2021
IEEE Access | VOL. 10

Feature Selection and Feature Stability Measurement Method for High-Dimensional Small Sample Data Based on Big Data Technology.
Chengyuan Huang
Computational Intelligence and Neuroscience | VOL. 2021
Chengyuan HuangChengyuan Huang
01 Jan 2020
Computational Intelligence and Neuroscience | VOL. 2021

Distinguishing two types of labels for multi-label feature selection
Ping Zhang ... Wanfu Gao
Pattern Recognition | VOL. 95
Ping Zhang, et. al.Ping Zhang ... Wanfu Gao
04 Jun 2019
Pattern Recognition | VOL. 95

Improved mutual information method for text feature selection
Xiaoming Ding ... Yan Tang
-
Xiaoming Ding, et. al.Xiaoming Ding ... Yan Tang
01 Apr 2013
01 Apr 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A New Big Data Feature Selection Approach for Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming