Abstract

Identifying statistically significant anomalies in an unlabeled data set is of key importance in many applications such as financial security and remote sensing. Rare category detection (RCD) helps address this issue by passing candidate data examples to a labeling oracle (e.g., a human expert) for labeling. A challenging task in RCD is to discover all categories without any prior information about the given data set. A few approaches have been proposed to address this issue, which are on quadratic or cubic time complexities w.r.t. the data set size N and require considerable labeling queries involving time-consuming and expensive labeling effort of a human expert. In this paper, aiming at solutions with lower time complexity and less labeling queries, we propose two prior-free (i.e., without any prior information about a given data set) RCD algorithms, namely (1) iFRED which achieves linear time complexity w.r.t. N, and (2) vFRED which substantially reduces the number of labeling queries. This is done by tabulating each dimension of the data set into bins, followed by zooming out to shrink each bin down to a position and conducting wavelet analysis on the data density function to fast locate the position (i.e., a bin) of a rare category, and zooming in the located bin to select candidate data examples for labeling. Theoretical analysis guarantees the effectiveness of our algorithms, and comprehensive experiments on both synthetic and real data sets further verify the effectiveness and efficiency.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.