Preprocessing in High Dimensional Datasets

Amparo Alonso-Betanzos,Borja Seijo-Pardo,Laura Morán-Fernández,Carlos Eiras-Franco,Verónica Bolón-Canedo

doi:10.1007/978-3-319-67513-8_11

Abstract

In the last few years, we have witnessed the advent of Big Data and, more specifically, Big Dimensionality, which refers to the unprecedented number of features that are rendering existing machine learning inadequate. To be able to deal with these high-dimensional spaces, a common solution is to use data preprocessing techniques which might help to reduce the dimensionality of the problem. Feature selection is one of the most popular dimensionality reduction techniques. It can be defined as the process of detecting the relevant features and discarding the irrelevant and redundant ones. Moreover, discretization can help to reduce the size and complexity of a problem in Big Data settings, by diminishing data from a large domain of numeric values to a subset of categorical values. This chapter describes in detail these preprocessing techniques as well as providing examples of new implementations developed to deal with Big Data.

Full Text