Abstract

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

Highlights

  • Feature selection (FS) plays a key role in data mining [1], especially in text classification task that suffers from large dimensionality [2] in many application domains such as sentiment analysis [3], emotion identification [4, 5], and spam filtering [6]

  • Results and Discussion. e macro-F1 and micro-F1 performances of MTF-mutual information (MI) are compared to four widely used feature selection techniques using Naıve Bayes classifier applied on three datasets (20 Newsgroups, Reuters-21578, and WebKB). e four feature selection techniques used for comparison are the classical MI, Chi-square (CHI), Term Frequency-Inverse Document Frequency (TF-IDF), and Information Gain (IG)

  • Conclusion is paper introduces Mutual Information (MTF-MI), a distributed feature selection approach designed upon the MapReduce programming model. e proposed approach, based on mutual information method, has been implemented using Apache Hadoop, and it has been applied over three different large datasets. e performance of resulting classification models generated by MTF-MI has been systematically evaluated using Naıve Bayes classifier, implemented in Hadoop framework, over a cluster of five computers. e experimental study has proved that MTFMI efficiently improves the selection of the relevant features while discarding the selection of irrelevant ones. e proposed approach is the best in average of F-measure compared to four state-of-the-art methods, namely, CHI, term frequency (TF)-IDF, MI, and IG

Read more

Summary

Introduction

Feature selection (FS) plays a key role in data mining [1], especially in text classification task that suffers from large dimensionality [2] in many application domains such as sentiment analysis [3], emotion identification [4, 5], and spam filtering [6]. (3) An algorithm is proposed to calculate the MI value to evaluate the relationship between features and categories under the Hadoop framework. We introduce our approach based on the TF and term distribution to evaluate the feature ti in category ck as follows: MTF − MI

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.