A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Jung Wun Lee,Ofer Harel

doi:10.6339/24-jds1140

Abstract

Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Abstract

Talk to us

Similar Papers

More From: Journal of Data Science

Lead the way for us

Journal: Journal of Data Science	Publication Date: Jan 1, 2024
License type: CC BY 4.0

Similar Papers

Detection of Phishing Webpages Using Heterogeneous Transfer Learning
Karl R Weiss ... Taghi M Khoshgoftaar
-
Karl R Weiss, et. al.Karl R Weiss ... Taghi M Khoshgoftaar
01 Oct 2017
01 Oct 2017

Statistics in Brief: How to Assess Bias in Clinical Studies?
Jerome Lambert
Clinical Orthopaedics & Related Research | VOL. 469
Jerome LambertJerome Lambert
31 Aug 2010
Clinical Orthopaedics & Related Research | VOL. 469

Generative Adversarial Networks for Black-Box API Attacks with Limited Training Data
Yi Shi ... Yalin E Sagduyu
-
Yi Shi, et. al.Yi Shi ... Yalin E Sagduyu
01 Dec 2018
01 Dec 2018

Geostatistical characterization of local accuracies in remotely sensed land cover change categorization with complexly configured reference samples
Jingxiong Zhang ... Wenjing Yang
Remote Sensing of Environment | VOL. 223
Jingxiong Zhang, et. al.Jingxiong Zhang ... Wenjing Yang
18 Jan 2019
Remote Sensing of Environment | VOL. 223

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Two-Stage Classification for Dealing with Unseen Clusters in the Testing Data

Abstract

Talk to us

Similar Papers

More From: Journal of Data Science