Snopy: Bridging Sample Denoising with Causal Graph Learning for Effective Vulnerability Detection

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Deep Learning (DL) has emerged as a promising means for vulnerability detection due to its ability to automatically derive features from vulnerable code. Unfortunately, current solutions struggle to focus on vulnerability-related parts of vulnerable functions, and tend to exploit spurious correlations for prediction, thus undermining their effectiveness in practice. In this paper, we propose Snopy, a novel DL-based approach, which bridges sample denoising with causal graph learning to capture real vulnerability patterns from vulnerable samples with numerous noise for effective detection. Specifically, Snopy adopts a change-based sample denoising approach to automatically weed out vulnerability-irrelevant code elements in the vulnerable functions without sacrificing the label accuracy. Then, Snopy constructs a novel Causality-Aware Graph Attention Network (CA-GAT) with Feature Caching Scheme (FCS) to learn causal vulnerability features while maintaining efficiency. Experiments on the three public benchmark datasets show that Snopy outperforms the state-of-the-art baselines by an average of 27.22%, 85.89%, and 75.50% in terms of F1-score, respectively.

Similar Papers
  • Research Article
  • 10.4172/2329-9002.1000e113
Causal Genomic and Epigenomic Network Analysis emerges as a New Generation of Genetic Studies of Complex Diseases.
  • May 1, 2013
  • Journal of phylogenetics & evolutionary biology
  • Momiao Xiong

Causal Genomic and Epigenomic Network Analysis emerges as a New Generation of Genetic Studies of Complex Diseases.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3674729
DSHGT : Dual-Supervisors Heterogeneous Graph Transformer—A Pioneer Study of Using Heterogeneous Graph Learning for Detecting Software Vulnerabilities
  • Nov 22, 2024
  • ACM Transactions on Software Engineering and Methodology
  • Tiehua Zhang + 6 more

Vulnerability detection is a critical problem in software security and attracts growing attention both from academia and industry. Traditionally, software security is safeguarded by designated rule-based detectors that heavily rely on empirical expertise, requiring tremendous effort from software experts to generate rule repositories for large code corpus. Recent advances in deep learning, especially Graph Neural Networks (GNN), have uncovered the feasibility of automatic detection of a wide range of software vulnerabilities. However, prior learning-based works only break programs down into a sequence of word tokens for extracting contextual features of codes, or apply GNN largely on homogeneous graph representation (e.g., AST) without discerning complex types of underlying program entities (e.g., methods, variables). In this work, we are one of the first to explore heterogeneous graph representation in the form of Code Property Graph and adapt a well-known heterogeneous graph network with a dual-supervisor structure for the corresponding graph learning task. Using the prototype built, we have conducted extensive experiments on both synthetic datasets and real-world projects. Compared with the state-of-the-art baselines, the results demonstrate superior performance in vulnerability detection (average F1 improvements over 10% in real-world projects) and language-agnostic transferability from C/C \({+}{+}\) to other programming languages (average F1 improvements over 11%).

  • Research Article
  • Cite Count Icon 10
  • 10.1609/aaai.v38i15.29566
Federated Causality Learning with Explainable Adaptive Optimization
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Dezhi Yang + 5 more

Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (FedCausal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, FedCausal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that FedCausal can effectively deal with non-independently and identically distributed (non-iid) data and has a superior performance.

  • Research Article
  • Cite Count Icon 11
  • 10.1186/s12911-024-02510-6
Developing a novel causal inference algorithm for personalized biomedical causal graph learning using meta machine learning
  • May 27, 2024
  • BMC medical informatics and decision making
  • Hang Wu + 2 more

BackgroundModeling causality through graphs, referred to as causal graph learning, offers an appropriate description of the dynamics of causality. The majority of current machine learning models in clinical decision support systems only predict associations between variables, whereas causal graph learning models causality dynamics through graphs. However, building personalized causal graphs for each individual is challenging due to the limited amount of data available for each patient.MethodIn this study, we present a new algorithmic framework using meta-learning for learning personalized causal graphs in biomedicine. Our framework extracts common patterns from multiple patient graphs and applies this information to develop individualized graphs. In multi-task causal graph learning, the proposed optimized initial guess of shared commonality enables the rapid adoption of knowledge to new tasks for efficient causal graph learning.ResultsExperiments on one real-world biomedical causal graph learning benchmark data and four synthetic benchmarks show that our algorithm outperformed the baseline methods. Our algorithm can better understand the underlying patterns in the data, leading to more accurate predictions of the causal graph. Specifically, we reduce the structural hamming distance by 50-75%, indicating an improvement in graph prediction accuracy. Additionally, the false discovery rate is decreased by 20-30%, demonstrating that our algorithm made fewer incorrect predictions compared to the baseline algorithms.ConclusionTo the best of our knowledge, this is the first study to demonstrate the effectiveness of meta-learning in personalized causal graph learning and cause inference modeling for biomedicine. In addition, the proposed algorithm can also be generalized to transnational research areas where integrated analysis is necessary for various distributions of datasets, including different clinical institutions.

  • Research Article
  • Cite Count Icon 67
  • 10.3390/s22093581
A Novel Smart Contract Vulnerability Detection Method Based on Information Graph and Ensemble Learning.
  • May 8, 2022
  • Sensors
  • Lejun Zhang + 6 more

Blockchain presents a chance to address the security and privacy issues of the Internet of Things; however, blockchain itself has certain security issues. How to accurately identify smart contract vulnerabilities is one of the key issues at hand. Most existing methods require large-scale data support to avoid overfitting; machine learning (ML) models trained on small-scale vulnerability data are often difficult to produce satisfactory results in smart contract vulnerability prediction. However, in the real world, collecting contractual vulnerability data requires huge human and time costs. To alleviate these problems, this paper proposed an ensemble learning (EL)-based contract vulnerability prediction method, which is based on seven different neural networks using contract vulnerability data for contract-level vulnerability detection. Seven neural network (NN) models were first pretrained using an information graph (IG) consisting of source datasets, which then were integrated into an ensemble model called Smart Contract Vulnerability Detection method based on Information Graph and Ensemble Learning (SCVDIE). The effectiveness of the SCVDIE model was verified using a target dataset composed of IG, and then its performances were compared with static tools and seven independent data-driven methods. The verification and comparison results show that the proposed SCVDIE method has higher accuracy and robustness than other data-driven methods in the target task of predicting smart contract vulnerabilities.

  • Research Article
  • 10.1145/3777420
FORTIFY: Feature-Oriented Representation and Graph Topology Integration for Path-Level Vulnerability Detection
  • Nov 15, 2025
  • ACM Transactions on Architecture and Code Optimization
  • Pingchuan Ma + 5 more

Source code vulnerability detection via graph learning is one of the most important approaches to maintain software security, as it enables structural analysis of semantic dependencies within programs. However, it may suffer from vulnerability coverage, semantic sparsity, trigger path identification, especially when those vulnerabilities do not involve API/library calls. In this paper, we present FORTIFY, a graph learning framework that couples feature representation tightly with program topology to perform path-level vulnerability detection. Beginning with a program dependence graph, FORTIFY reconstructs its Sliced Combined Graph (SCG) using program slicing with diverse edges. The SCG is then generated as a weighted edge hypergraph, enabling the model to capture both local semantic and structure relationships. Through path embeddings, we introduce an adaptive hyperedge-aware strategy to allocate high capacity vectors reaching security sensitive nodes. A relation-aware graph convolutional network, equipped with risk sensitive attention and an Information Noise Contrastive Estimation (InfoNCE) objective, further amplifying the weights of high risk paths. Experimental results on the publicly available datasets (i.e., SARD, NVD, and FFmpeg-Vul) show that FORTIFY can identify the execution paths of vulnerabilities. We also test it on real world software such as the PX4 open-source drone, and it finds that there are control type vulnerabilities in PX4, verifying that FORTIFY can be used for the analysis of programs including unmanned agents. The implementation of FORTIFY is publicly available at https://github.com/ACoTAI/FORTIFY.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/icdm.2019.00042
Bi-directional Causal Graph Learning through Weight-Sharing and Low-Rank Neural Network
  • Nov 1, 2019
  • Hao Huang + 2 more

Discovering the causal graph in multivariate time series data is of great importance for industrial society, yet challenging due to the unknown nonlinearity in the data. Existing works only explore the data in chronological order, and rely on pre-assumed kernels or certain distribution assumption. In this paper, we present a Bi-directional neural network for Causal Graph Learning (Bi-CGL) through weight-sharing and low-rank neural network. It discovers the causal graph by simultaneously exploring input in forward and reverse chronological order. Both directions approach the same causal graph with shared low-rank approximation, which provides robustness and better accuracy against data noise. Experiments on synthetic and real world datasets prove our Bi-CGL's outperformance over existing baselines.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/acssc.2015.7421273
Causality graph learning on cortical information flow in Parkinson's disease patients during behaviour tests
  • Nov 1, 2015
  • Abdulaziz Almalaq + 5 more

Electroencephalographs (EEG) signals of the human brains represent electrical activities for a number of channels recorded over a the scalp. The main purpose of this paper is to investigate the interactions and causality of different parts of a brain using EEG signals recorded during a performance subjects of verbal fluency tasks. Subjects who have Parkinson's Disease (PD) have difficulties with mental tasks, such as switching between one behavior task and another. The behavior tasks include motor and phonemic fluency. This method uses verbal generation skills, activating different Broca's areas of the Brodmann's areas (BA44 and BA45). Advanced signal processing techniques are used in order to determine the activated frequency bands in the granger causality for verbal fluency tasks. The graph learning technique for channel strength is used to characterize the complex graph of Granger causality. Also, the support vector machine (SVM) method is used for training a classifier between two subjects with PD and two healthy controls. Neural data from the study was recorded at the Colorado Neurological Institute (CNI).

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v39i15.33795
DCILP: A Distributed Approach for Large-Scale Causal Structure Learning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Shuyu Dong + 6 more

Causal learning tackles the computationally demanding task of estimating causal graphs. This paper introduces a new divide-and-conquer approach for causal graph learning, called DCILP. In the divide phase, the Markov blanket MB(Xi) of each variable Xi is identified, and causal learning subproblems associated with each MB(Xi) are independently addressed in parallel. This approach benefits from a more favorable ratio between the number of data samples and the number of variables considered. In counterpart, it can be adversely affected by the presence of hidden confounders, as variables external to MB(Xi) might influence those within it. The reconciliation of the local causal graphs generated during the divide phase is a challenging combinatorial optimization problem, especially in large-scale applications. The main novelty of DCILP is an original formulation of this reconciliation as an integer linear programming (ILP) problem, which can be delegated and efficiently handled by an ILP solver. Through experiments on medium to large scale graphs, and comparisons with state-of-the-art methods, DCILP demonstrates significant improvements in terms of computational complexity, while preserving the learning accuracy on real-world problem and suffering at most a slight loss of accuracy on synthetic problems.

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.aei.2021.101516
Counterfactual inference to predict causal knowledge graph for relational transfer learning by assimilating expert knowledge --Relational feature transfer learning algorithm
  • Dec 31, 2021
  • Advanced Engineering Informatics
  • Jiarui Li + 2 more

Counterfactual inference to predict causal knowledge graph for relational transfer learning by assimilating expert knowledge --Relational feature transfer learning algorithm

  • Research Article
  • 10.46610/jbdtba.2026.v05i01.001
TradeWarNet: A Causal Attention-based Graph Framework for Analysing Trade War Shocks in Financial Markets
  • Jan 1, 2026
  • Journal of Big Data Technology and Business Analytics
  • Sharad Pandurang Latkar + 1 more

In the prevailing global economic environment, renewed trade tensions, selective tariff measures, and strategic trade interventions have significantly influenced financial market dynamics. These trade war-related shocks generate complex, time-varying, and cross-market effects that are difficult to capture using traditional econometric and correlation-based machine learning approaches. In this paper, TradeWarNet, a causal and attention-based graph learning framework designed to quantify and interpret the impact of trade war shocks on global financial markets is proposed. The proposed framework integrates causal event modelling to isolate trade-induced effects from broader macroeconomic influences and employs a temporal graph attention network to capture dynamic shock transmission across equity, foreign exchange, and commodity markets. Empirical analysis using multi-asset data from both developed and emerging economies demonstrates that TradeWarNet achieves improved volatility forecasting performance, enhanced structural break detection, and greater interpretability compared to benchmark models. The results further indicate that emerging markets exhibit higher sensitivity to trade war shocks under current market conditions, while select safe-haven assets display stabilizing characteristics. The proposed framework offers a policy-relevant and interpretable machine learning approach for analyzing trade-related financial risks.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.ress.2024.110468
Information-based Gradient enhanced Causal Learning Graph Neural Network for fault diagnosis of complex industrial processes
  • Aug 28, 2024
  • Reliability Engineering and System Safety
  • Ruonan Liu + 4 more

Information-based Gradient enhanced Causal Learning Graph Neural Network for fault diagnosis of complex industrial processes

  • Book Chapter
  • Cite Count Icon 19
  • 10.1007/978-3-642-04584-4_14
Mining Causal Relationships in Multidimensional Time Series
  • Jan 1, 2010
  • Yasser Mohammad + 1 more

Time series are ubiquitous in all domains of human endeavor. They are generated, stored, and manipulated during any kind of activity. The goal of this chapter is to introduce a novel approach to mine multidimensional time-series data for causal relationships. The main feature of the proposed system is supporting discovery of causal relations based on automatically discovered recurring patterns in the input time series. This is achieved by integrating a variety of data mining techniques. The main insight of the proposed system is that causal relations can be found more easily and robustly by analyzing meaningful events in the time series rather than by analyzing the time series numerical values directly. The RSST (Robust Singular Spectrum Transform) algorithm is used to find interesting points in every time series that is further analyzed by a constrained motif discovery algorithm (if needed) to learn basic events of the time series. The Granger-causality test is extended and applied to the multidimensional time-series describing the occurrences of these basic events rather than to the raw time-series data. The combined algorithm is evaluated using both synthetic and real world data. The real world application is to mine records of activities during a human-robot interaction experiment in which a human subject is guiding a robot to navigate using free hand gesture. The results show that the combined system can provide causality graphs representing the underlying relations between the human’s actions and robot behavior that cannot be recovered using standard causal graph learning procedures.

  • Conference Article
  • Cite Count Icon 29
  • 10.1145/3357384.3357864
Scalable Causal Graph Learning through a Deep Neural Network
  • Nov 3, 2019
  • Chenxiao Xu + 2 more

Learning the causal graph in a complex system is crucial for knowledge discovery and decision making, yet it remains a challenging problem because of the unknown nonlinear interaction among system components. Most of the existing methods either rely on predefined kernel or data distribution, or they focus simply on the causality between a single target and the remaining system. This work presents a deep neural network for scalable causal graph learning (SCGL) through low-rank approximation. The SCGL model can explore nonlinearity on both temporal and intervariable relationships without any predefined kernel or distribution assumptions. Through low-rank approximation, the noise influence is reduced, and better accuracy and high scalability are achieved. Experiments using synthetic and real-world datasets show that our SCGL algorithm outperforms existing state-of-the-art methods for causal graph learning.

  • Research Article
  • 10.1016/j.csda.2024.108065
Online graph topology learning from matrix-valued time series
  • Sep 16, 2024
  • Computational Statistics and Data Analysis
  • Yiye Jiang + 2 more

Online graph topology learning from matrix-valued time series

Save Icon
Up Arrow
Open/Close