Maximizing classifier utility when there are data acquisition and modeling costs

Gary M Weiss,Ye Tian

doi:10.1007/s10618-007-0082-x

Abstract

Classification is a well-studied problem in data mining. Classification performance was originally gauged almost exclusively using predictive accuracy, but as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this cost impacts the total utility of the data mining process. In this article we analyze the relationship between the number of acquired training examples and the utility of the data mining process and, given the necessary cost information, we determine the number of training examples that yields the optimum overall performance. We then extend this analysis to include the cost of model induction--measured in terms of the CPU time required to generate the model. While our cost model does not take into account all possible costs, our analysis provides some useful insights and a template for future analyses using more sophisticated cost models. Because our analysis is based on experiments that acquire the full set of training examples, it cannot directly be used to find a classifier with optimal or near-optimal total utility. To address this issue we introduce two progressive sampling strategies that are empirically shown to produce classifiers with near-optimal total utility.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Maximizing classifier utility when there are data acquisition and modeling costs

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery

Lead the way for us

Journal: Data Mining and Knowledge Discovery	Publication Date: Sep 6, 2007
Citations: 66

Similar Papers

Maximizing Theory Accuracy Through Selective Reinterpretation
Shlomo Argamon-Engelson ... Hillel Walters
Machine Learning | VOL. 41
Shlomo Argamon-Engelson, et. al.Shlomo Argamon-Engelson ... Hillel Walters
01 Jan 1999
Machine Learning | VOL. 41

Improving Markov Logic Network learning using unlabeled data
Tak-Lam Wong ... Philip M Tsang
-
Tak-Lam Wong, et. al.Tak-Lam Wong ... Philip M Tsang
01 Jul 2010
01 Jul 2010

Improved Estimates for the Accuracy of Small Disjuncts
J.R Quinlan
Machine Learning | VOL. 6
J.R QuinlanJ.R Quinlan
01 Jan 1991
Machine Learning | VOL. 6

Informed Selection of Training Examples for Knowledge Refinement
Nirmalie Wiratunga ... Susan Craw
-
Nirmalie Wiratunga, et. al.Nirmalie Wiratunga ... Susan Craw
01 Jan 1999
01 Jan 1999

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Maximizing classifier utility when there are data acquisition and modeling costs

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery