Abstract
Training data are essential for learning classification models. Therefore, if only a limited number of labeled subjects are available for use as training samples, whereas a considerable amount of unlabeled data already exists, then it is always desirable enlarging the training set by labeling more subjects in order to ameliorate classification models. When it is costly in time and capital to label unlabeled subjects, it is crucial to know how many labeled subjects are necessary for training a satisfactory classification model. Although, active learning methods can gradually recruit new unlabeled subjects and disclose their label information to enlarge the size of the training set, there is a lack of discussion about the size of training samples in the literature. Hence, when/how to appropriately stop an active learning procedure is studied in this paper. Since the sequential subject recruiting strategy is used in active learning procedures, it is natural to adopt the idea of sequential analysis to dynamically and adaptively determine the training sample size for learning. In this study, we propose a stopping criterion for a linear model-based active learning procedure, such that this learning process will asymptotically achieve its best possible empirical performance, in terms of the area under receiver the operating characteristic curve (ROC), when the procedure is stopped. Other statistical properties of the proposed procedure, including estimation consistency and variable selection, are also studied. The numerical results using both synthesized and a real example are reported.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have