Analysis of Various Techniques for Solving the Problem of Big Data Classification

Mei Lu

doi:10.15866/iremos.v14i4.20987

Abstract

The purpose of this study is to assess the effectiveness of various algorithms for big data classification, namely, partial least squares discriminant analysis (PLS-DA), NaiveBayes (NBC) and K-Nearest Neighbor (KNN) based on the Hadoop MapReduce approach. The effectiveness of the approaches is compared to the classification of big data sets of average shot lengths (CSV). It has been shown that in accordance with the data set size, the PLS-DA classification accuracy increases and reaches 82%, and the computation time goes up to 45 seconds. The analysis of various classifiers showed that high accuracy rates for the PLS-DA classifier are ensured by a high percentage of positive and negative cases properly classified, and lower accuracy for KNN and NaiveBayes is justified by a high percentage of false-positive and false-negative indicators. It is concluded that the optimal classifier is the PLS-DA method, which allows one to classify a large amount data with high accuracy in a short time.

Full Text