Abstract
The goal of simultaneous feature selection and outlier detection is to determine a sparse linear regression vector by fitting a dataset possibly affected by the presence of outliers.The problem is well-known in the literature. In its basic version it covers a wide range of tasks in data analysis. Simultaneously performing feature selection and outlier detection strongly improves the application potential of regression models in more general settings, where data governance is a concern. To trigger this potential, flexible training models are needed, with more parameters under control of decision makers.The use of mathematical programming, although pertinent, is scarce in this context and mostly focusing on the least-squares setting. Instead we consider the least absolute deviation criterion, proposing two mixed-integer linear programs, one adapted from existing studies, and the other obtained from a disjunctive programming argument. We show theoretically and computationally that the disjunctive-based formulation is better in terms of both continuous relaxation quality and integer optimality convergence.We experimentally benchmark against existing methodologies from the literature. We identify the characteristics of contamination patterns, in which mathematical programming is better than state-of-the-art algorithms in combining prediction quality, sparsity and robustness against outliers. Additionally, the mathematical programming approaches allow the decision maker to directly control parameters like the number of features or outliers to tolerate, those based on least absolute deviations performing best. On real world datasets, where privacy is a concern, our approach compares well to state-of-the-art methods in terms of accuracy, being at the same time more flexible.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have