A VISUAL GRAPHIC BASED MODELING FRAMEWORK OPTI-BLEND FOR INTEGRATED CODE ANALYSIS
The current software systems are becoming complicated, heterogeneous and spread out making the task of code analysis a complicated task. The tools used in the traditional program analysis work independently, the statistical analysis, dynamic analysis, inspection of dependencies, vulnerability scanning, and quality assessment are commonly done separately. The result of this fragmentation is a lack of contextual knowledge, decreased explainability and inability to find root causes of defects or vulnerabilities. In order to overcome such shortcomings, the current paper suggests the creation of Opti-Blend, a visual graph-based modeling system of integrated code analysis. Opti-Blend converts several program representations, such as Abstract Syntax Trees (AST), Control Flow Graphs (CFG), Data Flow Graphs (DFG), Program Dependence Graphs (PDG) and Call Graphs, into a Hybrid Program Graph (HPG). The framework proposes a graph fusion mechanism to be used to combine multi-view representations to a semantic model. A query layer of visualization allows people to explain the issues and investigate them through the graph paths and dependencies. The suggested system can assist in defect detection, vulnerability, and code smell identification as well as dependency risk assessment all in a single visual setting. The experimental validation on open-source repositories proves to be a better detection tool and better traceability than individual tools. Opti-Blend is a contribution to a single, understandable and extendable modeling paradigm of next-generation integrated code intelligence systems.
- Research Article
4
- 10.1016/j.jss.2023.111941
- Dec 27, 2023
- Journal of Systems and Software
On the impact of multiple source code representations on software engineering tasks — An empirical study
- Research Article
2
- 10.4018/ijossp.2017040101
- Apr 1, 2017
- International Journal of Open Source Software and Processes
Code clones are copied fragments that occur at different levels of abstraction and may have different origins in a software system. This article presents an approach which shows the significant parts of source code. Further, by using significant parts of a source code, a control flow graph can be generated. This control flow graph represents the statements of a code/program in the form of basic blocks or nodes and the edges represent the control flow between those basic blocks. A hybrid approach, named the Program Dependence Graph (PDG) is also presented in this article for the detection of non-trivial code clones. The program dependency graph approach consists of two approaches as a control dependency graph and a data dependency graph. The control dependency graph is generated by using a control flow graph. This article proposes an approach which can easily generate control flow graphs and by using control flow graph and reduced flowgraph approach, the trivial software clone, a similar textual structure, can be detected.The proposed approach is based on a tokenization concept.
- Conference Article
8
- 10.1109/icsme55016.2022.00042
- Oct 1, 2022
Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.
- Research Article
219
- 10.1016/j.infsof.2021.106576
- Mar 20, 2021
- Information and Software Technology
BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection
- Conference Article
1
- 10.1109/saner56733.2023.00053
- Mar 1, 2023
Automatic code annotation generation aims to generate readable annotations that describe the functionality of source code, which may facilitate software developers and programmers. Previous methods follow the encoder-decoder structures where the encoders are based on the abstract syntax trees (ASTs) to encode syntactic structures of code fragments. However, the AST alone cannot fully express complicated control structures, data flows, or dependencies of source code, leading to sub-optimal annotations. On the other hand, a functionality can be implemented in various ways with possibly different structures and token names. Most methods treat code fragments independently and do not exploit these similarities among code fragments. In this paper, we present HANCode2Seq, an automatic code annotation generation method by utilizing the code heterogeneous representation graph. Specifically, we construct the heterogeneous graph by combining multiple code induced graphs, including abstract syntax trees, control flow graphs, data flow graphs, and program dependency graphs. Then a heterogeneous graph attention network is applied to extract the comprehensive semantic meanings and syntactic structures of the source code fragments. Furthermore, we present a novel adaptive code similarity graph with code fragments being nodes. The representation of a code fragment is enhanced by aggregating information from other similar fragments on the graph, which may reduce the ambiguity of the code. The experimental results on real datasets show that our proposed model outperforms other baselines and produces more fluent and readable code annotations.
- Research Article
1
- 10.1145/3725212
- Mar 26, 2025
- ACM Transactions on Software Engineering and Methodology
Improving the performance of software applications is one of the most important tasks in software evolution and maintenance. In the Intel Microarchitecture, CPUs employ pipelining to utilize resources as effectively as possible. Some types of software patterns or algorithms can have implications on the underlying CPU pipelines and result in inefficiencies. Therefore, analyzing how well the CPU’s pipeline(s) are being utilized while running an application is important in software performance analysis. Existing techniques, such as Intel VTune Profiler, usually detect software performance issues from CPU pipeline metrics after the software enters production and during the running time. These techniques require developers to manually analyze monitoring data and perform additional test runs to obtain relevant information about performance problems. It costs a lot of time and human effort for developers to build, deploy, test, execute, and monitor the software. To alleviate these problems, we propose a novel approach named PGProf to predict the CPU pipeline before execution and provide the profiling feedback during the development process. PGProf exploits the graph neural networks to learn semantic and structural representations for C functions and then predict the fraction of pipeline slots in each category for them during the development process. Given a code snippet, we fuse different types of code structures, e.g., Abstract Syntax Tree (AST), Data Flow Graph (DFG), and Control Flow Graph (CFG) into one program graph. During offline learning, we first leverage the gated graph neural network to capture representations of C functions. PGProf then automatically estimates the final pipeline values according to the learned semantic and structural features. For online prediction, we predict pipeline metrics with four category values by leveraging the offline trained model. We build our dataset from C projects in GitHub and use Intel VTune profiler to get profiling information by running them. Extensive experimental results show the promising performance of our model. We achieved absolute result of 49.90% and 79.44% in terms of \(Acc@5\%\) and \(Acc@10\%\) with improvements of 8.0%-42.7% and 7.8%-20.1% over a set of baselines.
- Book Chapter
1
- 10.4018/979-8-3373-4862-9.ch005
- Jun 27, 2025
Large Language Models (LLMs) have gained traction in domains from software development to cybersecurity, particularly for detecting vulnerabilities in program source code. Their ability to analyze large codebases and identify security weaknesses makes them valuable in software security analysis. However, their effectiveness declines in the absence of intermediate representations such as Abstract Syntax Trees (AST), Control Flow Graphs (CFG), and Data Flow Graphs (DFG), or even tokenized forms of code. In this research study, we assess the performance of LLMs in detecting vulnerabilities directly from raw source code, without structural representations. By designing context-specific prompts, we aim to enhance the model's understanding of code semantics. Our findings show that LLMs can partially identify vulnerabilities from raw code alone, reaching up to 43% accuracy. This indicate both the potential and current limitations of prompt-based LLMs for static vulnerability detection.
- Research Article
4
- 10.1186/s42400-024-00245-5
- Oct 11, 2024
- Cybersecurity
Smart contracts have significant losses due to various types of vulnerabilities. However, traditional vulnerability detection methods rely extensively on expert rules, resulting in low detection accuracy and poor adaptability to novel attacks. To address these problems, in this paper, deep learning methods are combined with smart contract vulnerability code detection approaches. syntax trees (ASTs), which are special isomorphic graph structures, are an important bridge between source code and graph neural networks. By learning the AST, the model can understand the semantics of the source code. Moreover, graph neural networks have an increasing ability to address complex heterogeneous graphs. Therefore, control flow graphs are fused with data flow graphs on the basis of the ASTs to build heterogeneous graphs with richer code semantics. Furthermore, multigranularity analysis of the vulnerability detection results is performed, including coarse-grained contract-level vulnerability detection and fine-grained line-level vulnerability detection. Through this multigranularity detection approach, vulnerabilities in contracts can be identified and analysed more comprehensively, providing a richer perspective and more solutions for vulnerability detection. The experimental results show that the proposed multigranularity vulnerability detection method based on heterogeneous graphs (MVD-HG) improves both the accuracy and range of the detected vulnerability types in contract-level vulnerability detection tasks; moreover, in the line-level vulnerability detection task, the MVD-HG model achieves significant results and addresses the shortcomings of existing methods. In addition, based on code generation methods used in related fields, a data enhancement method based on the source code is developed, which effectively expands the experimental dataset to address the reduced credibility of the results due to insufficient amounts of data.
- Research Article
- 10.30871/jaic.v9i6.11090
- Dec 5, 2025
- Journal of Applied Informatics and Computing
This study develops and evaluates an automated assessment model using Abstract Syntax Trees (AST) with a view to overcoming the limitations of string-matching techniques in the assessment of Fill-in-the-Blank (FIB) programming answers. Traditional string-matching techniques have a relatively high False Negative Rate (FNR) of 21.5% within the context of detecting semantic equivalence. The current model uses semantic structural triangulation to ascertain the semantic similarity of student answers. Technical assessment shows that the AST approach markedly reduces the FNR to 4.5%. The model demonstrates high reliability (ϰ = 0.83) with high classification accuracy (F1 Score = 0.966) which attests to its inferential validity. From a pedagogical perspective, system implementation leads to substantial learning gains, evidenced by a large effect size (Cohen’s d = 1.82) and a high normalized gain (Normalized Gain = 0.90). Multiple regression analysis confirms that semantic accuracy is the primary causal factor driving improved student comprehension. Ontologically, while AST is valid as a partial representation, its limitations—particularly tree isomorphism in recursive structures—highlight the need for further exploration of graph isomorphism approaches. Control Flow Graphs (CFG) and Data Flow Graphs (DFG) offer more expressive relational models for capturing control and data dependencies. The model demonstrates functional feasibility with a System Usability Scale (SUS) score of 76.47. Overall, the AST Triangulation Model is validated as pedagogically effective, inferentially robust, and supportive of evaluative transparency. Future research recommends validating the model on more complex tasks and releasing it as open-source to support reproducibility.
- Conference Article
11
- 10.1109/wpc.2005.17
- May 15, 2005
Refactoring is an essential and useful practice in developing and maintaining object-oriented software since it improves the design of existing code without changing its external behavior. Therefore, several refactoring tools tend to be integrated into contemporary IDEs. However, these tools represent source code as an abstract syntax tree (AST) and thus their implementations are hard to extend and modify. This paper presents Jrbx, a refactoring tool that uses a fine-grained XML representation of source code and supports stylized manipulations of the representation. Moreover, Jrbx aggressively exploits control flow graphs (CFGs) and program dependence graphs (PDGs) for both precondition checking and change creation. The use of the XML, CFG, and PDG representations makes the implementation of Jrbx more understandable and reusable, and thus facilitates tool developers creating new refactorings and modifying existing ones.
- Research Article
- 10.26599/tst.2024.9010220
- Aug 1, 2026
- Tsinghua Science and Technology
As mobile applications become increasingly complex and privacy regulations continue to evolve, the task of accurately identifying app violations in compliance detection has become a major challenge. Prior works mainly relied on taint analysis and dynamic monitoring to address this issue. However, taint analysis requires specifying sources and sinks for each violation, leading to multiple analysis rounds and inefficiency. Meanwhile, dynamic monitoring suffers from incomplete coverage, resulting in high false negatives. In this paper, we propose a novel graph structure, called Behavior Property Graph (BPG), for detecting non-compliant behaviors in Android applications. BPG integrates the features from various graph representations, including Abstract Syntax Tree (AST), Control Flow Graph (CFG), Call Graph (CG), Program Dependency Graph (PDG), and Pointer Assignment Graph (PAG), enabling comprehensive modeling of complex app behaviors. Violations are identified by querying the BPG using behavioral patterns extracted from real-world apps. We develop a prototype system called BPGᴇɴ to generate BPGs and evaluated its performance by testing seven types of non-compliant behaviors on a dataset of 200 real-world apps. Notably, BPGᴇɴ detects 14 violations within 13 previously unreported non-compliant applications. The results show that BPGᴇɴ can efficiently and effectively detect app compliance violations.
- Research Article
1
- 10.1186/s13677-024-00629-5
- Apr 1, 2024
- Journal of Cloud Computing
The growth of multimedia applications poses new challenges to software facilities in edge computing. Developers must effectively develop edge computing software to accommodate the rapid expansion of multimedia applications. Code search has become a prevalent practice to enhance the efficiency of the construction of edge software infrastructure. Researchers have proposed lots of approaches for code search, and employed deep learning technology to extract features from program representations, such as token, AST, graphs, method name, and API. Nevertheless, two prominent issues remain: 1) there are only a few studies on the effective use of graph representation for code search (especially in Java language), and 2) there is a lack of empirical study on the contributions of different program representations. To address these issues, we conduct an empirical study to explore program representations, especially program graphs. To the best of our knowledge, this is the first attempt to conduct code search with mixed graphs representation for Java language, containing the control flow graph and the program dependence graph. We also present a hybrid approach to capture and fuse the features of a program with representations of Token, AST, and Mixed Graphs (TAMG). The results of our experiment show that our approach possesses the best ability (R@1 with 37% and R@10 with 67.1%). Our graph representation exhibits a positive effect, and the token and AST also have a significant contribution to the code search. Our findings can aid developers in efficiently searching for the desired code while constructing the software infrastructure for edge computing, which is crucial for the rapid expansion of multimedia applications.
- Research Article
2
- 10.1088/1742-6596/1487/1/012031
- Mar 1, 2020
- Journal of Physics: Conference Series
Artificial Intelligence has played an increasingly important role in visual defect detection in recent years, while there are many challenges using deep learning for this application, such as the shortage of data, lack of knowledge of root cause of defects. In this paper, we combine deep learning with traditional AI methods, not only to solve unshaded defect detection but also find root causes of detected defects. First, we propose a taxonomy method called DataonomySM to extend a meta defect dataset with a small number of samples and a deep learning method to detect the image defects. For detected defect images, we use a generalized multi-image matting algorithm to extract common defects automatically. We apply this technology to identify defects that stem from systematic errors in a product line and later extended its use to watermark processing. Experimental results have shown great capability and versatility of our proposed methods.
- Conference Article
5
- 10.1145/3597503.3608136
- Feb 6, 2024
While the majority of existing pre-trained models from code learn source code features such as code tokens and abstract syntax trees, there are some other works that focus on learning from compiler intermediate representations (IRs). Existing IR-based models typically utilize IR features such as instructions, control and data flow graphs (CDFGs), call graphs, etc. However, these methods confuse variable nodes and instruction nodes in a CDFG and fail to distinguish different types of flows, and the neural networks they use fail to capture long-distance dependencies and have over-smoothing and over-squashing problems. To address these weaknesses, we propose FAIR, a Flow type-Aware pre-trained model for IR that involves employing (1) a novel input representation of IR programs; (2) Graph Transformer to address over-smoothing, over-squashing and long-dependencies problems; and (3) five pre-training tasks that we specifically propose to enable FAIR to learn the semantics of IR tokens, flow type information, and the overall representation of IR. Experimental results show that FAIR can achieve state-of-the-art results on four code-related downstream tasks.
- Research Article
- 10.1038/s41598-025-31209-5
- Dec 11, 2025
- Scientific reports
Current software defect prediction and code quality assessment methods treat these inherently related tasks independently, failing to leverage their complementary information. Existing graph-based approaches lack the ability to jointly model structural dependencies and quality characteristics, limiting their effectiveness in capturing the complex relationships between defect patterns and code quality indicators. This paper proposes a novel integrated model that simultaneously tackles both objectives using graph neural networks to leverage the inherent graph structure of software systems. Our novelty lies in the first-of-its-kind integration of multi-level graph representations (AST, CFG, DFG) with a dual-branch attention-based GNN architecture for simultaneous defect prediction and quality assessment. Our approach constructs multi-level graph representations by integrating abstract syntax trees, control flow graphs, and data flow graphs, capturing both syntactic and semantic relationships in source code. The proposed dual-branch GNN architecture employs shared representation learning with attention mechanisms and multi-task optimization to exploit complementary information between defect prediction and quality assessment tasks. Comprehensive experiments on six real-world software projects demonstrate significant improvements over traditional methods, achieving F1-scores of 0.811 and AUC values of 0.896 for defect prediction, while showing 9.3% average improvement in code quality assessment accuracy across multiple quality dimensions. The integration strategy proves effective in capturing complex structural dependencies and provides actionable insights for software development teams, establishing a foundation for intelligent software engineering tools that deliver comprehensive code analysis capabilities.