Distributed feature selection using vertical partitioning for high dimensional data

Bakshi Rohit Prasad,Unmesh Kishor Bendale,Sonali Agarwal

doi:10.1109/icacci.2016.7732145

Abstract

Feature selection is one of the most significant steps in machine learning that reduces the features space in order to achieve faster learning and yielding simpler models with high accuracy and interpretability. With rapid development in technologies, large scale high dimensional datasets are common today which degrades the performance of traditional feature selection techniques as they suffer with the scalability issues. Parallel feature selection is an obvious solution to deal with this problem. Due to advent of many distributed computing frameworks scalable computing has become a viable strategy in reference to feature selection. Present work proposes a distributed parallel feature selection technique that employs vertical distribution strategy for dataset to exploit parallel computation. It uses information gain filter based ranking method which evaluates multiple disjoint feature subsets of dataset in parallel. The key idea is the distribution of evaluation and rank generation of features over several computing nodes in parallel. Experiments are performed on multiple large scale and high dimension datasets and significant reduction in overall computation time is achieved.

Full Text