Unsupervised Labeling and Extraction of Phrase-based Concepts in Vulnerability Descriptions

Sofonias Yitagesu,Linyi Han,Zhenchang Xing,Zhiyong Feng,Xiaohong Li,Xiaowang Zhang

doi:10.1109/ase51524.2021.9678638

Abstract

People usually describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for the automatic analysis of vulnerabilities. Automatic extraction of key vulnerability aspects is highly desirable but demands significant effort to manually label data for model training. In this paper, we propose an unsupervised approach to label and extract important vulnerability concepts in textural vulnerability descriptions (TVDs). We focus on three types of phrase-based vulnerability concepts (root cause, attack vector, and impact) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product, and version). Our approach is based on a key observation that the same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence parsing trees. Therefore, we propose two path representations (absolute paths and relative paths) and use an auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution, and thus creates a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further use FIt-TSNE and clustering techniques to generate clusters of the same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE for encoding path representations and the accuracy of vulnerability concepts in the resulting clusters. In a concept classification task, our unsupervisedly labeled vulnerability concepts outperform the two manually labeled datasets from previous work.

Full Text