Software vulnerabilities, once disclosed, can be documented in vulnerability databases, which have great potential to advance vulnerability analysis and security research. People describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for the automatic analysis of vulnerability knowledge embedded in text. Automatic extraction of key vulnerability aspects is highly desirable but demands significant effort to manually label data for model training. In this article, we propose unsupervised methods to label and extract important vulnerability concepts in textual vulnerability descriptions (TVDs). We focus on six types of phrase-based vulnerability concepts (vulnerability type, vulnerable component, root cause, attacker type, impact, and attack vector) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product, and version). Our approach is based on a key observation that the same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence parsing trees. Specifically, we present a source-target neural architecture that learns the Part-of-Speech (POS) tagging to identify a token’s functional role within TVDs, where the source neural model is trained to capture common features found in the TVD corpus, and the target model is trained to identify linguistically malformed words specific to the security domain. Our evaluation confirms that the proposed tagger outperforms (4.45%–5.98%) the taggers designed on natural language notions and identifies a broad set of TVDs and natural language contents. Then, based on the key observations, we propose two path representations (absolute paths and relative paths) and use an auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance the traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution and thus create a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further apply unsupervised clustering techniques to generate clusters of the same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE, which achieves a small (85.85) log-likelihood for encoding path representations and the accuracy (83%–89%) of vulnerability concepts in the resulting clusters. The resulting clusters accurately label six types of vulnerability concepts from a TVD corpus in an unsupervised way. Furthermore, these labeled vulnerability concepts can be mapped back to the corresponding phrases in the original TVDs, which produce labels of six types of vulnerability concepts. The resulting labeled TVDs can be used to train concept extraction models for other TVD corpora. In this work, we present two concept extraction methods (concept classification and sequence labeling model) to demonstrate the utility of the unsupervisedly labeled concepts. Our study shows that models trained with our unsupervisedly labeled vulnerability concepts outperform (3.9%–5.14%) those trained with the two manually labeled TVD datasets from previous work due to the consistent boundary and typing by our unsupervised labeling method.
Read full abstract