Chapter 8 - Data transformations

Eibe Frank,Christopher J Pal,Mark A Hall,Ian H Witten

doi:10.1016/b978-0-12-804291-5.00008-8

Abstract

There are many transformations that can make real-world datasets more amenable to the learning algorithms discussed in the rest of the book. We first consider methods for attribute selection, which remove attributes that are not useful for the task at hand. Then we look at discretization methods: algorithms for turning numeric attributes into discrete ones. Next we discuss several techniques for projecting data into a space that is more suitable for learning: well-known methods for dimensionality reduction, including unsupervised approaches such as principal component analysis, independent component analysis, and random projections, as well as supervised approaches such as partial least squares regression and linear discriminant analysis. We consider how to turn textual data into numeric attribute vectors so that standard learning techniques can be applied, and present simple methods for approaching time series data. The last four sections deal with data sampling, data cleansing, generic approaches for multiclass classification, and calibration of class probabilities, respectively. Sampling is nontrivial when the data arrives as a stream, and we discuss the “reservoir” method for taking an unbiased sample in this case. Data cleansing can be performed by iteratively applying standard supervised learning algorithms to remove outliers, but there are also dedicated techniques for anomaly detection and so-called “one-class learning” that are applicable. For dealing with multiclass classification problems, we consider several ways of decomposing them into a set of two-class problems, e.g., by applying error-correcting output codes. Finally, we describe how to calibrate class probability estimates to improve their accuracy.

Full Text