Abstract

High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings.

Highlights

  • Logistic regression is widely used to solve classification tasks and provides a probabilistic relation between a set of covariates and a binary or multi-class response [1,2]

  • # true negatives) and specificity is defined as (# true negatives)/(# true negatives + # false positives). While this is a function of the sparsity level imposed on MIP and MIProb, for enetLTS and Lasso we show the mean values across eight repetitions due to the intrinsic randomness induced by cross-validation methods

  • In the following we present the results based on k p = 8, where the balanced accuracy for both methods is very close to their maximum

Read more

Summary

Introduction

Logistic regression is widely used to solve classification tasks and provides a probabilistic relation between a set of covariates (i.e., features, variables or predictors) and a binary or multi-class response [1,2]. Since the log-odds ratio depends linearly on the set of covariates included in the model, an adversarial contamination of the latter might create bad leverage values that break down ML-based approaches [5] This motivates the development of robust estimation techniques. In our analysis, based on a logistic regression model, we are able to exclude redundant features from the fit while accounting for potential data contamination through an estimation approach that simultaneously addresses sparsity and statistical robustness. This provides important insights on the main drivers of honey bee loss during overwintering—such as the exposure to pesticides, as well as the average temperature of the driest quarter and the precipitation level during the warmest quarter.

Background
Penalized Logistic Regression
Robust Logistic Regression
MIProb
Algorithmic Implementation
Additional Details
Simulation Study
Method
Investigating Overwintering Honey Bee Loss in Pennsylvania
Model Formulation and Data
Results
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call