A Text Classification Approach using Parallel Naive Bayes in Big Data Context

Houda Amazal,Mohamed Kissi,Mohammed Ramdani

doi:10.1145/3289402.3289536

Abstract

Text classification is a domain that has been inspiring researchers since many years. Indeed, several approaches have been developed in order to find methods that improve the performance of text classification. But in last decades, because of the technological evolution, textual data becomes more and more abundant on the web. So that classical classification methods are unable to process this huge amount of data and consequently cannot produce satisfied results. Thus, new ways have been explored; to overcome the big dimensions of data, it was necessary to reduce the size of the features of documents and use parallel processing. For this, in our work, we developed a Term Frequency- Inverse Document Frequency (TF-IDF) parallel model to save only the most relevant words in documents. Then, we feed the dataset to a parallel Naive Bayes classifier. Both, the TF-IDF parallel model and parallel Naïve Bayes classifier were implemented on Hadoop system using the MapReduce architecture. The experimental results demonstrate the efficiency of the proposed method to improve the classification accuracy.

Full Text