Con2Mix (Contrastive Double Mixup) is a new semi-supervised learning methodology that innovates a triplet mixup data augmentation approach for finding code vulnerabilities in imbalanced, tabular security data sets. Tabular data sets in cybersecurity domains are widely known to pose challenges for machine learning because of their heavily imbalanced data (e.g., a small number of labeled attack samples buried in a sea of mostly benign, unlabeled data). Semi-supervised learning leverages a small subset of labeled data and a large subset of unlabeled data to train a learning model. While semi-supervised methods have been well studied in image and language domains, in security domains they remain underutilized, especially on tabular security data sets which pose especially difficult contextual information loss and balance challenges for machine learning. Experiments applying Con2Mix to collected security data sets show promise for addressing these challenges, achieving state-of-the-art performance on two evaluated data sets compared with other methods.
Read full abstract