Abstract

Given a labeled dataset that contains a rare (or minority) class containing of-interest instances, as well as a large class of instances that are not of interest, how can we learn to recognize future of-interest instances over a continuous stream? The setting is different from traditional classification in that instances from novel minority subclasses might continually emerge over time—and hence is often referred as continual, life-long, or open-world classification. We introduce RaRecognize, which ( i ) estimates a general decision boundary between the rare class and the majority class, ( ii ) learns to recognize the individual rare subclasses that exist within the training data, as well as ( iii ) flags instances from previously unseen rare subclasses as newly emerging (i.e., novel). The learner in (i) is general in the sense that by construction it is dissimilar to the specialized learners in (ii) , thus distinguishes minority from the majority without overly tuning to what is only seen in the training data. Thanks to this generality, RaRecognize ignores all future instances that it labels as majority and recognizes the recurring as well as emerging rare subclasses only. This saves effort at test time as well as ensures that the model size grows moderately over time as it only maintains specialized minority learners. Overall, we build an end-to-end system which consists of (1) a representation learning component that transforms data instances into suitable vector inputs; (2) a continual classifier that labels incoming instances as majority (not of interest), rare recurrent, or rare emerging; and (3) a clustering component that groups the rare emerging instances into novel subclasses for expert vetting and model re-training. Through extensive experiments, we show that RaRecognize outperforms state-of-the art baselines on three real-world datasets that contain documents related to corporate-risk and (natural and man-made) disasters as rare classes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.