Risky Host Detection with Bias Reduced Semi-Supervised Learning

Shuning Wu,Ningwei Liu,Ligang Zhang,Joel Fulton,Charles Feng

doi:10.1145/3349341.3349365

Abstract

To ensure the cyber security of an enterprise, a SIEM (Security Information and Event Management) system is in place to flag alerts and assign each of them a severity score based on some pre-determined rules. Analysts in the security operations center investigate the high severity alerts to decide if those alerts are truly malicious or not. However, generally the number of alerts is overwhelmingly large, far exceeding the SOC's capacity to handle them, and the majority of them are false positive. There is a great need for a machine learning system to accurately detect the risky hosts. Traditional supervised learning algorithms cannot be directly applied to this problem as very few risky hosts (positive labels) are identified and the positive labels are biased because the SOC analysts only investigate high severity alerts. In this paper, we propose a new distance-based PU learning approach, in which we use four different distances to measure similarity to the positive labels and a Gaussian Copula function to capture their correlation structure and ensemble four different distance measures into one joint probability density that we can directly use to infer new labels. The new approach has the advantage of significantly reducing the bias of the inferred labels while traditional supervised PU learning increases bias. To quantify bias, we also propose a new bias estimate method. We apply the new bias-reduction Positive Unlabeled (PU) learning system to detect host risk in cyber security. Results on real enterprise data indicate that the proposed PU learning is able to detect risky hosts effectively while at the same time greatly reducing the label bias. t-SNE 2-dimensional visualization also demonstrates that the labels from distance-based PU learning are more evenly distributed with higher Kozachenko-Leonenko entropy.

Full Text