Abstract

In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of five feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop YARN. Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared. We identify commonalities among the five features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all five feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop.

Highlights

  • The meaning of ‘‘big’’ with respect to data is specific to each application domain and dependent on the computational resources available

  • We showed in Hodge et al (2006) that using Advanced Uncertain Reasoning Architecture (AURA) speeds up the Mutual Information Feature Selection (MI) feature selector by over 100 times compared to a standard implementation of MI

  • This takes more coordinating in the Hadoop framework as the data for the feature value may not be stored with the data for the class; they may be in different Correlation Matrix Memories (CMMs) stripes

Read more

Summary

Introduction

The meaning of ‘‘big’’ with respect to data is specific to each application domain and dependent on the computational resources available. We have previously developed a k-NN classification (Hodge & Austin, 2005; Weeks, Hodge, O’Keefe, Austin, & Lees, 2003) and prediction algorithm (Hodge, Krishnan, Austin, & Polak, 2011) using an associative memory (binary) neural network called the Advanced Uncertain Reasoning Architecture (AURA) (Austin, 1995) This multi-faceted k-NN motivated a unified feature selection framework exploiting the speed and storage efficiency of the associative memory neural network. Researchers have parallelised individual feature selection algorithms using MapReduce/Hadoop (Chu et al, 2007; Reggiani, 2013; Singh, Kubica, Larsen, & Sorokina, 2009; Sun, 2014) Data mining libraries such as Mahout (https:// mahout.apache.org) and MLib (https://spark.apache.org/mllib/) and data mining frameworks such as Radoop Com/products/radoop/) include a large number of data mining algorithms including feature selectors They do not explicitly tackle processing reuse with a view to multi-user and multi-task resource allocation.

Binary neural networks
AURA recall
Feature selection
Mutual information feature selection
Correlation-based feature subset selection
Gain ratio feature selection
Chi-square algorithm
Odds ratio
Parallel and distributed AURA
Parallel AURA
Distributed AURA
Hadoop feature selection
Analysis of AURA feature selection
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.