Classification and feature selection methods based on fitting logistic regression to PU data

Konrad Furmańczyk,Kacper Paczutkowski,Marcin Dudziński,Diana Dziewa-Dawidczyk

doi:10.1016/j.jocs.2023.102095

Abstract

In our work, we examine the classification methods where the positive and unlabeled data are considered and where the conditional distribution of the true class label given the feature vector is governed by the model of logistic regression. Our first objective is to compute and compare the selected metrics allowing for the quality assessment of these methods. In this context, we investigate four methods of the posterior probability estimation, where the risk of logistic loss function is optimized: the naive approach, the weighted likelihood approach, as well as the quite recently proposed methods – the joint approach, and the LassoJoint method. The corresponding evaluations are basically performed for 13 machine learning models on some chosen – both low- and high-dimensional – datasets. Some of the mentioned machine learning model schemes have been directly borrowed from literature and some have been obtained through some modifications in the existing procedures. Our second goal is to establish the most stable and efficient approach for the posterior probability estimation. Moreover, we use the AdaSampling scheme for comparison of the considered classification methods. We also conducted comparisons of feature selection procedures – the Mutual Information-Based feature selection method and the LassoJoint approach. The current article is an enhancement of the conference paper Furmańczyk et al. (2022).

Full Text