A CURE Algorithm for Vietnamese Sentiment Classification in a Parallel Environment

Vo Ngoc Phu,Jack Max,Vo Thi Ngoc Tran

doi:10.3844/jcssp.2019.1355.1377

Abstract

Solutions to process big data are imperative and beneficial for numerous fields of research and commercial applications. Thus, a new model has been proposed in this paper to be used for big data set sentiment classification in the Cloudera parallel network environment. Clustering Using Representatives (CURE), combined with Hadoop MAP (M) / REDUCE (R) in Cloudera – a parallel network system, was used for 20,000 documents in a Vietnamese testing data set. The testing data set included 10,000 positive Vietnamese documents and 10,000 negative ones. After testing our new model on the data set, a 62.92% accuracy rate of sentiment classification was achieved. Although our data set is small, this proposed model is able to process millions of Vietnamese documents, in addition to data in other languages, to shorten the execution time in the distributed environment

Highlights

Solutions to process big data are imperative and beneficial for numerous fields of research and applications
The average time of the semantic classification of the Clustering Using Representatives (CURE) algorithm in the sequential environment is 21,600 seconds/20,000 documents. This rate is greater than the average time of the emotion classification of the CURE Algorithm (CA) in the Cloudera parallel network environment with three nodes, which is 7,198 seconds/20,000 documents
The average execution time of the sentiment classification of the CA in the Cloudera parallel network environment with two nodes is faster than the average execution time of the sentiment classification of the CA in the Cloudera parallel network environment with three nodes

Summary

Introduction

Solutions to process big data are imperative and beneficial for numerous fields of research and applications. Clustering can be considered the most significant unsupervised learning problem; similar to other problems of this kind, it deals with findings in a collection of unlabeled data. A cluster only includes objects that share similar characteristics. Clustering Using Representatives (CURE), which proposed the CURE algorithm – a hierarchical clustering algorithm (Guha et al, 1998), is an efficient data clustering algorithm for large databases. The objective of this survey is to process numerous Vietnamese big data sets by using the CURE algorithm in the Cloudera distributed environment. The results of this study can be used to cross check sentiment classification for various fields of research and commercial applications

Methods

Results

Conclusion