Abstract
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.
Highlights
The problem of discrimination and classification has generated an extensive literature both from the biostatistical/epidemiological and from the machine learning communities
A discrimination rule has to be estimated on the basis of a training sample of n = n0+n1 observations, representative of the two underlying populations; in machine learning terminology this is called supervised learning
Solutions range from classical Fisher Linear Discriminant Analysis and Logistic Regression to kernel based methods, Random Forests and Support Vector Machines
Summary
The problem of discrimination and classification has generated an extensive literature both from the biostatistical/epidemiological and from the machine learning communities. In its simplest form two populations, identified by a binary (dependent) variable y (y = 0 for controls or non-cases and y = 1 for cases, e.g. individuals with a specific disease), have to be distinguished on the basis of a set of p (e.g. genetic or behavioral) traits or (independent) (co)variables x = (x1,..,xp)T,. The role of individual components of x is often of interest as this may provide insight in the mechanisms that generate values of y; for example the role of a (mutant) gene in the etiology of a disease. A discrimination rule has to be estimated on the basis of a training sample of n = n0+n1 observations, representative of the two underlying populations; in machine learning terminology this is called supervised learning. A nice overview is given by James et al [1]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.