An effective procedure for feature subset selection in logistic regression based on information criteria

Enrico Civitelli,Fabio Schoen,Matteo Lapucci,Alessio Sortino

doi:10.1007/s10589-021-00288-1

Enrico Civitelli, Fabio Schoen + Show 2 more

Open Access

https://doi.org/10.1007/s10589-021-00288-1

Copy DOI

Abstract

In this paper, the problem of best subset selection in logistic regression is addressed. In particular, we take into account formulations of the problem resulting from the adoption of information criteria, such as AIC or BIC, as goodness-of-fit measures. There exist various methods to tackle this problem. Heuristic methods are computationally cheap, but are usually only able to find low quality solutions. Methods based on local optimization suffer from similar limitations as heuristic ones. On the other hand, methods based on mixed integer reformulations of the problem are much more effective, at the cost of higher computational requirements, that become unsustainable when the problem size grows. We thus propose a new approach, which combines mixed-integer programming and decomposition techniques in order to overcome the aforementioned scalability issues. We provide a theoretical characterization of the proposed algorithm properties. The results of a vast numerical experiment, performed on widely available datasets, show that the proposed method achieves the goal of outperforming state-of-the-art techniques.

Highlights

In statistics and machine learning, binary classification is one of the most recurring and relevant tasks
Logistic regression belongs to the class of Generalized Linear Models and possesses a number of useful properties: it is relatively simple; it is readily interpretable; outputs are informative, as they have a probabilistic interpretation; statistical confidence measures can quickly be obtained; the model can be updated by simple gradient descent steps if new data are available; in practice it often has good predictive performance, especially when the size of train data is too limited to exploit more complex models
The rest of the manuscript is organized as follows: in Sect. 2, we formally introduce the problem of best subset selection in logistic regression, state optimality conditions and provide a brief review of a related approach

Summary

Introduction

In statistics and machine learning, binary classification is one of the most recurring and relevant tasks This problem consists of identifying a model, selected from a hypothesis space, able to separate samples characterized by a well-defined set of numerical features and belonging to two different classes. We are interested in the problem of best features subset selection in logistic regression This variant of standard logistic regression requires to find a model that, in addition to accurately fitting the data, exploits a limited number of features. In this way, the obtained model only employs the most relevant features, with benefits in terms of both performance and interpretation

Objectives

Methods

Findings

Conclusion