High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance

Yang Liu,Xi Wang,Xiang Li,Huaqiang Li,Xianbang Chen

doi:10.1155/2020/1953461

Yang Liu, Xi Wang + Show 3 more

Open Access

https://doi.org/10.1155/2020/1953461

Copy DOI

Journal: Scientific programming	Publication Date: May 18, 2020
Citations: 11	License type: CC BY 4.0

Affiliation: Sichuan University

Abstract

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.

Highlights

Classification is one of the most effective approaches in enabling the analysis of the digital data in quite a number of academia and research fields, for example, the medical researches [1,2,3,4,5,6] and the power system researches [7,8,9,10,11,12]
In order to implement the large-scale data classification, the Hadoop framework based on MapReduce computing model [51] is employed to parallelize the improved back propagation neural network (BPNN). is paper firstly separates the entire training dataset into a number of data chunks which are saved in HDFS (Hadoop Distributed File System), and each participated mapper initializes one sub-BPNN and inputs one data chunk, respectively
In order to serve the classifications for large-scale data, this paper presents a parallelized improved BPNN algorithm

Summary

Research Article

Received December 2019; Revised February 2020; Accepted 5 May 2020; Published 18 May 2020. Erefore, this paper presents an improved artificial neural network in enabling the highperformance classification for the imbalanced large volume data. The Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. The ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. E parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. E experimental results show the effectiveness of the presented classification algorithm Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. e improvements for the input layer and hidden layer enhance the training performances in terms of convergence. e parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. e experimental results show the effectiveness of the presented classification algorithm

Introduction

Scientific Programming

Input layer Hidden layer

ReLU b

Weighted voting

Data block i

Class balancing algorithm

Testing instance number

Training error

Minimum accuracy Maximum accuracy

Batch size

Parallelized LSTM Standalone BPNN Parallelized BPNN

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming

Lead the way for us

Similar Papers

Integrating Growth and Environmental Parameters to Discriminate Powdery Mildew and Aphid of Winter Wheat Using Bi-Temporal Landsat-8 Imagery
Huiqin Ma ... Yue Shi
Remote sensing | VOL. 11
Huiqin Ma, et. al.Huiqin Ma ... Yue Shi
08 Apr 2019
Remote sensing | VOL. 11

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue
Zhongyi Hu ... Ilung Pranata
Industrial Management & Data Systems | VOL. 119
Zhongyi Hu, et. al.Zhongyi Hu ... Ilung Pranata
04 Dec 2018
Industrial Management & Data Systems | VOL. 119

Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique
Gayeong Eom ... Haewon Byeon
Mathematics | VOL. 11
Gayeong Eom, et. al.Gayeong Eom ... Haewon Byeon
21 Aug 2023
Mathematics | VOL. 11

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors.
Abhigyan Nath ... Karthikeyan Subbiah
3 Biotech | VOL. 6
Abhigyan Nath, et. al.Abhigyan Nath ... Karthikeyan Subbiah
21 Mar 2016
3 Biotech | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific programming