Abstract

One of the key challenges of machine learning-based anomaly detection relies on the difficulty of obtaining anomaly data for training, which is usually rare, diversely distributed, and difficult to collect. To address this challenge, we formulate anomaly detection as a Positive and Unlabeled (PU) learning problem where only labeled positive (normal) data and unlabeled (normal and anomaly) data are required for learning an anomaly detector. As a semi-supervised learning method, it does not require providing labeled anomaly data for the training, thus it is easily deployed to various applications. As the unlabeled data can be extremely unbalanced, we introduce a novel PU learning method, which can tackle the situation where an unlabeled data set is mostly composed of positive instances. We start by using a linear model to extract the most reliable negative instances followed by a self-learning process to add reliable negative and positive instances with different speeds based on the estimated positive class prior. Furthermore, when feedback is available, we adopt boosting in the self-learning process to advantageously exploit the instability characteristic of PU learning. The classifiers in the self-learning process are weighted combined based on the estimated error rate to build the final classifier. Extensive experiments on six real datasets and one synthetic dataset show that our methods have better results under different conditions compared to existing methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call