Abstract

In the paper, we revisit the problem of class prior probability estimation with positive and unlabelled data gathered in a single-sample scenario. The task is important as it is known that in positive unlabelled setting, a classifier can be successfully learned if the class prior is available. We show that without additional assumptions, class prior probability is not identifiable and thus the existing non-parametric estimators are necessarily biased in general if extra assumptions are not imposed. The magnitude of their bias is also investigated. The problem becomes identifiable when the probabilistic structure satisfies mild semi-parametric assumptions. Consequently, we propose a method based on a logistic fit and a concave minorization of its (non-concave) log-likelihood. The experiments conducted on artificial and benchmark datasets as well as on a large clinical database MIMIC indicate that the estimation errors for the proposed method are usually lower than for its competitors and that it is robust against departures from logistic settings.

Highlights

  • Positive and unlabelled (PU) learning focuses on the setting where the data contains labelled positive examples and unlabelled ones

  • In the paper, we revisit the problem of class prior probability estimation with positive and unlabelled data gathered in a single-sample scenario

  • When classifying web page preferences, some web pages can be bookmarked as positive (S = 1) by the user whereas all other pages are treated as unlabelled (S = 0)

Read more

Summary

B Paweł Teisseyre

Positive and unlabelled (PU) learning focuses on the setting where the data contains labelled positive examples and unlabelled ones. PU setting can be seen as a special case of more general problem of learning from noisy labels (Natarajan et al 2013; Frenay and Verleysen 2014) when labels are incorrectly assigned In such general scenario, value of the true class variable Y can be flipped with some probability, i.e. instead Y we observe S = 1 − Y. The class prior is usually not known (except from situations when, for example, disease prevalence is known or can be learnt from other studies) and the problem of its estimation from PU data has attracted significant attention (Elkan and Noto 2008; Jain et al 2016; Plessis et al 2017; Bekker and Davis 2018). The existing methods; Sect. 5.1 introduces the novel method, Sect. 5.2 compares it with MLR method (Jaskie et al 2020), Sect. 6 summarizes the results of numerical experiments and Sect. 7 concludes the paper

Notation and assumptions
Identifiability of class prior
Elkan-Noto estimator
TIcE estimator
Partial matching
Estimating the class prior via logistic regression
MLR estimator and its comparision with JOINT method
Experiments
Simulation models
Benchmark datasets
Experiment on clinical dataset MIMIC
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.