Security vulnerabilities in software are the root cause of cyberattacks. Considering that these defects have huge associated costs, they should be proactively detected and resolved before shipping the software. Data-driven approaches like Artificial Intelligence (AI) are vastly explored for automatic vulnerability detection, given their potential to leverage large-scale vulnerability data feeds and learn from these scenarios. This work introduces a novel Proximal Instance Aggregator (PIA) neural network for accurately capturing insecure C code patterns from Abstract Syntax Tree (AST). It is built upon the concept of Multiple Instance Learning (MIL), which treats the AST representation of the code as a ‘bag’ of tree path ‘instances’. The security vulnerability can manifest in one or multiple such AST path instances. The PIA model dynamically learns a set of abstract concepts to describe the patterns associated with the AST paths. Specifically, the vulnerable nature of an AST path is characterized by its proximity to these concepts. The model also employs the attention mechanism to generate deep representations. By drawing cross-correlation of features between the path instances, the self-attention robustly weighs the relevance of each AST path towards vulnerability classification. The MIL utilizes these deep feature sets to construct the concept space. Thus, even without explicit supervision for localizing the line of defect, the AI automatically learns AST instance classification in a weakly supervised manner. Since AST-level prediction is formed as an aggregation of instance classifications, the AI is inherently explainable. The model outperforms state-of-the-art methods by a fair margin. It achieves 95.63% detection accuracy and 95.65% F1-score on the benchmarked NIST SARD, NVD datasets for a range of vulnerabilities.
Read full abstract