Hadoop neural network for parallel and distributed feature selection

Victoria J Hodge,Simon O’Keefe,Jim Austin

doi:10.1016/j.neunet.2015.08.011

Abstract

In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop.

Highlights

The meaning of ‘‘big’’ with respect to data is specific to each application domain and dependent on the computational resources available
We showed in Hodge et al (2006) that using Advanced Uncertain Reasoning Architecture (AURA) speeds up the Mutual Information Feature Selection (MI) feature selector by over 100 times compared to a standard implementation of MI
This takes more coordinating in the Hadoop framework as the data for the feature value may not be stored with the data for the class; they may be in different Correlation Matrix Memories (CMMs) stripes

Summary

Introduction

The meaning of ‘‘big’’ with respect to data is specific to each application domain and dependent on the computational resources available. We have previously developed a k-NN classification (Hodge & Austin, 2005; Weeks, Hodge, O’Keefe, Austin, & Lees, 2003) and prediction algorithm (Hodge, Krishnan, Austin, & Polak, 2011) using an associative memory (binary) neural network called the Advanced Uncertain Reasoning Architecture (AURA) (Austin, 1995) This multi-faceted k-NN motivated a unified feature selection framework exploiting the speed and storage efficiency of the associative memory neural network. Researchers have parallelised individual feature selection algorithms using MapReduce/Hadoop (Chu et al, 2007; Reggiani, 2013; Singh, Kubica, Larsen, & Sorokina, 2009; Sun, 2014) Data mining libraries such as Mahout (https:// mahout.apache.org) and MLib (https://spark.apache.org/mllib/) and data mining frameworks such as Radoop Com/products/radoop/) include a large number of data mining algorithms including feature selectors They do not explicitly tackle processing reuse with a view to multi-user and multi-task resource allocation.

Binary neural networks

AURA recall

Feature selection

Mutual information feature selection

Correlation-based feature subset selection

Gain ratio feature selection

Chi-square algorithm

Odds ratio

Parallel and distributed AURA

Parallel AURA

Distributed AURA

Hadoop feature selection

Analysis of AURA feature selection

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Neural Networks	Publication Date: Sep 5, 2015
Citations: 46	License type: cc-by

R Discovery Prime

R Discovery Prime

Hadoop neural network for parallel and distributed feature selection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neural Networks

Lead the way for us

Similar Papers

Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance
Sai Prasad Potharaju ... M Sreedevi
Clinical Epidemiology and Global Health | VOL. 7
Sai Prasad Potharaju, et. al.Sai Prasad Potharaju ... M Sreedevi
27 Apr 2018
Clinical Epidemiology and Global Health | VOL. 7

Features Selection in Statistical Classification of High Dimensional Image Derived Maize (<i>Zea Mays</i> L.) Phenomic Data
Peter Gachoki ... Gladys Njoroge
American Journal of Applied Mathematics and Statistics | VOL. 10
Peter Gachoki, et. al.Peter Gachoki ... Gladys Njoroge
07 Jun 2022
American Journal of Applied Mathematics and Statistics | VOL. 10

Fault-tolerance analysis of neural network for high voltage transmission line fault diagnosis
Jiang Huilan
-
Jiang Huilan Jiang Huilan
01 Jan 1997
01 Jan 1997

Distributed feature selection using vertical partitioning for high dimensional data
Bakshi Rohit Prasad ... Sonali Agarwal
-
Bakshi Rohit Prasad, et. al.Bakshi Rohit Prasad ... Sonali Agarwal
01 Sep 2016
01 Sep 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hadoop neural network for parallel and distributed feature selection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neural Networks